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PREFACE 


This  volume  contains  several  papers  describing  the  current  status  of  the 
Hearsay  II  system.  Some  of  these  papers  have  appeared  elsewhere  in  the  literature 
and  are  included  here  for  convenience.  The  relevant  references  appear  in  the  table  of 
contents 

Hearsay  I]  has  been  the  major  research  effort  of  the  CMU  spc-ech  group  over 
the  past  three  years  During  this  period,  solutions  were  devised  to  many  difficult 
conceptual  problems  thal  arose  during  the  implementation  of  Hearsay  I and  other 
earlier  efforts.  The  result  represents  not  only  an  interesting  system  design  for  speech 
understanding  blit  also  an  experiment  in  the  area  of  knowledge-based  systems 
architecture.  Attempts  are  being  made  by  other  Ai  groups  to  use  this  type  of  systems 
architecture  in  >m?  j understanding  and  other  knowledge-intensive  systems. 

The  Hearsay  II  effort  is  headed  by  Lee  Erman.  The  other  major  participants 
include  Frederick  Hayes-Roth,  Victor  Lesser,  and  Linda  Shockey.  The  Hearsay  II  group 
consists  of  7 full-time  researchers  and  four  graduate  students.  Several  others  in  the 
CMU  environment  participate  in  the  effort.  The  group  is  loosely  organized  into 
subgroups: 

System  design  (Lesser,  Gill,  McKeown,  Erman) 

Focus  of  attention  and  policy  (Hayes-Roth,  Lesser) 

Syntax,  Semantics,  and  Discourse  (Hayes-Roth,  Mostow,  Gill) 

Word  hypothesization  and  Verification  (Erman,  Cronk,  Smith) 

Phonetics  (Shockey,  Adam) 

Segmentation  and  Labeling  (Goldberg,  Reddy) 

Performance  analysis  (Erman,  Reddy,  Masulis) 

Data  base  (Shockey) 

The  various  papers  appearing  in  this  volume  are  representative  of  the  present 
state  of  conceptualization  and  implementation  of  the  system.  The  first  paper  provides 
an  overview  of  much  of  this  research.  The  system  has  been  operational  for  some  time 
and  most  of  the  effort  at  present  involves  performance  analysis,  acquisition  and 
representation  of  additional  knowledge,  and  experimentation  with  different  control 
strategies. 
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Overview  of  the  Hearsay 
Speech  Understanding  Research 

Lee  D.  Erman 

Hearsay  the  generic  name  lor  much  of  the 
speech  ur  ' standing  research  in  the  computer 
science  duparVient  at  Carnegie-Mellon  University 
(CMU).  Th  major  goals  of  this  research  include  the 
investigation  of  computer  knowledge-based  prob- 
lem-solving systems  and  the  practical  implementa- 
tion of  speech  input  to  computers.  An  emphasis  of 
this  effort  is  the  design  of  system  structures  for  ef- 
ficient implementation  of  such  systems. 

We  will  first  describe  the  problem  of  speech 
understanding  and  (in  Section  2)  present  the  con- 
text of  the  Hearsay  effort.  Section  3 describes  the 
Hearsay  model  and  implementation  philosophy. 
Then,  In  Section  4,  Hearsayl  is  described,  including 
some  major  design  limitations  which  formed  much 
of  the  motivation  for  Hearsayli,  described  in  Section 
5. 

I.  The  Problem  of  Speech  Understanding 

In  order  to  provide  a framework  for  discussion,  a 
conceptual  model  of  speech  communication  is 
presented: 

1)  The  purpose  of  a speech  utterance  is  to  trans- 
mit Information  from  the  speaker  to  the  listener. 

2)  The  speaker  starts  with  some  deep  semantic 
representation  of  the  message.  Several  kinds  of 
transformations  are  applied  to  this  representation 
(syntactic,  linguistic,  phonological,  neurological, 
articulatory,  acoustic,  etc).  The  result  is  an 
acoustic  signal. 

3)  The  acoustic  signal  is  detected  by  the  listener. 
The  listener  applies  transformations  which. are 
similar  (though  inverse)  to  those  of  the  speaker; 


the  result  is  some  semantic  representation  for  the 
listener. 

4)  The  correctness  or  effectiveness  of  the  trans- 
mission is  related  to  the  correspondence  be- 
tween the  meaning  that  the  speaker  intends  and 
the  meaning  derived  by  the  listener;  it  is  measured 
in  behavioral  terms— i.e. , what  actions  of  the 
listener  are  triggered  by  receipt  of  the  message 
(Very  often  this  behavior  takes  the  form  of  an  ut- 
terance generated  by  the  original  listener ) 

The  goal  of  automatic  speech  understanding  is  to 
produce  a machine  (usually  in  the  form  of  a com- 
puter program)  which  can  effectively  perform  as  the 
listener. 

The  problem  of  understanding  speech  with  the 
competence  of  a human  is  formidable  A reasonable 
plan  is  to  approach  the  most  general  kinds  of 
solutions  by  designing  and  building  a sequence  of 
systems,  each  of  which  is  more  ambitious  than  the 
previous.  There  are  many  dimensions  along  which 
to  move  to  provide  this  graded  sequence  (e  g., 
requirements  of  vocabulary  size,  speed  of  response, 
accuracy,  number  of  speakers).  A way  of  capturing 
these  various  dimensions  is  the  concept  of  a task— 
a well-defined  domain  within  which  the  machine  is  to 
perform  some  functions.  For  example,  the  task 
might  be  to  answer  the  user’s  (speaker's)  questions 
about  airline  flight  schedules  or  to  provide  an  inter- 
active computer-programming  facility.  In  defining- 
a task,  one  important  aspect  is  the  spoken  input 
language  This  language  is  pre-specified  lexically, 
syntactically,  and  semantically;  that  is,  descriptions 
are  given  of  the  words,  how  they  may  be  sequenced 
to  form  sentences,  and  the  meaning  of  the  sentences 
in  the  context  of  the  task.’ 

There  are  two  major  aspects  of  the  speech  com- 
munication process  which  generate  most  of  the 
problems  in  machine  understanding. 

1)  The  nature  of  the  speech  signal— The  trans- 
formations involved  in  speech  production  are 
many  and  complex,  and  they  strongly  interact 
with  each  other.  The  result  is  a very  large  amount 
of  variability  in  the  signal  which  conveys  little  or 
no  meaning,  i.e.,  which  ir,  noise  in  the  context  of 
the  speech  understanding  task.  Repetitions  of  the 
"same"  utterance,  spoken  by  one  speaker  under 
unchanging  conditions  just  seconds  apart  often 


1 This  use  of  a task  to  constrain  the  problem  is  not 
as  artificial  as  it  may  first  appear.  Usually  human 
speech  understanding  Is  also  performed  in  "con- 
strained domains”— In  almost  any  given  situation 
only  a small  subset  of  all  possible  messages  is 
likely. 
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result  in  significant  variation  of  the  sgnal  As  the 
various  conditions  (e.g  , identity  age:  gender 
emotional  state,  and  environment  of  the  speaker) 
are  relaxed  and  allowed  to  change,  this  variability 
increases  significantly  Further  strong  inter- 
actions occur  among  the  various  elements  that  is, 
words,  phones,  phrases,  etc  influence  and 
modify  nearby  words,  phones,  phrases,  etc,,  and 
thus  have  differing  manifestations  in  different  con- 
texts.2 

2)  The  nature  ot  our  knowledge  o I the  transtorma 
tions— Theories  which  attempt  to  explain  the  pro- 
duction of  speech  are,  in  general,  incomplete  and 
inadequate  in  explaining  the  phenomena  with  a 
great  deal  of  accuracy.  Also,  it  is  often  difficult 
to  translate  existing  theories  into  the  framework  of 
feasible  recognition  algorithms 
Largely  because  of  these  two  aspects,  the  kinds  of 
machine  speech  understanding  systems  developed 
can  be  characterized  as  having  several  interesting 
and  problem-laden  features: 

1)  The  system  must  make  use  of  multiple  and 
diverse  sources  ot  knowledge  to  solve  the  prob- 
lem (eg.,  acoustic-phonetics,  phonology,  syntax, 
semantics,  pragmatics);  these  knowledge  sources 
(KSs)  correspond  to  the  different  kinds  of  trans- 
formations that  generate  the  speech  signal  De- 
signing an  effective  control  structure  for  thete 
many  diverse  KSs  is  crucial  and  difficult. 

2)  Each  source  of  knowledge  is  incomplete  and 
errortul.  Thus,  although  it  is  used  in  an  attempt  to 
further  the  recognition  of  an  utterance,  each  K3 
will  also  introduce  errors  into  the  analysis  prone  ,s. 
The  different  sources  must  work  to  correct  each 
other's  mistakes  in  order  to  keep  errors  from 
propagating  excessively 

3)  The  systems  developed  tend  to  be  large  and 
complex  Building,  debugging,  understanding, 
and  evaluating  them  Is  difficult.  In  particular,  many 
researchers  need  to  interact  with  the  system  over 
a period  of  several  years,  both  experimenting  with 
its  operation  and  modifying  it.  An  important 
aspect  of  system  modification  is  the  ability  to 
modify  and  replace  individual  KSs. 

4)  Because  of  the  effectiveness  and  apparent  ease 
of  human  performance  in  the  speech  understand- 
ing task,  a useful  solution  to  the  problem  must  be 
a system  which  approaches  that  performance, 
primarily  in  terms  of  speed,  accuracy,  and, 
ultimately,  economy 


2 We  are  concerned  here  with  connected  speech 
input,  as  opposed  to  isolated  word  systems  in 
which  the  words  (for  short  phrases  treated  as  in- 
divisible units)  are  spoken  individually 


5)  Because  the  systems  tend  to  be  highly  experi- 
mental, they  must  be  exercised  often  and  over 
substantial  amounts  of  trial  data  The  perform- 
ance ot  the  system  while  under  development 
(particularly  in  terms  of  speed  of  execution)  is  an 
important  factor  in  determining  how  much  ex- 
perimentation can  occur  Thus,  issues  ol  perform- 
ance are  crucial  even  in  the  development  stage 

6)  Because  the  systems  are  complex  and  experi 
mental,  the  interface  through  which  the  researcher 
controls  and  interacts  with  the  system  is  crucial 
The  researcher  must  be  able  to  interact  with  the 
system  flexibly  and  at  the  functional  level  of  the 
system  (in  addition  to  the  more  traditional 
machine  language  and  programming  language 
levels) 

This  has  been  a short  introduction  into  the  prob- 
lems of  developing  speech  understanding  systems. 
A more  complete  analysis  of  the  problem  including 
pointers  to  the  relevant  literature,  can  be  found  in 
Newell  et  al  (71 ). 

2.  Context  of  This  Work 

Hearsay's  direct  lineage  can  be  traced  back  ten 
years.  The  work  of  Reddy  and  Reddy  & Vicens  at 
Stanford  University  (Reddy  (66J;  Reddy  and  Vicens 
(68);  Vicens  [69])  resulted  in  extending  the  state-of- 
the-art  of  isolated  word  recognition  systems  (e  g , 
91%  accuracy  on  a 561-word  lexicon  in  ten  times 
real-time  on  a PDP10  and  with  live  input).  This 
system  differed  from  most  earlier  ones,  which  were 
essentially  pattern  classifiers,  in  that  it  contained  a 
substantial  amount  ol  speech  knowledge  and  it  used 
extensive  heuristics  in  applying  the  knowledge  to 
prune  the  search  space.  In  addition,  one  version  of 
the  system  was  created  which  used  syntactic  con- 
straints and  operated  on  connected  speech 
(although  in  a very  ad  hoc  and  unextendable  manner). 

The  Hearsay  model  for  speech  understanding  was 
developed  at  CMU  during  1970-1971  (Reddy,  Er- 
man,  and  Neely  [70];  Reddy  [71];  Reddy,  Erman,  and 
Neely  [72]).  This  model  faced  the  problems  of 
speech  understanding  (i.e.,  in  a task  domain)  and 
connected  speech  The  Hearsayl  system  was 
designed  and  built  as  an  implementation  of  this 
model  (Reddy,  Erman,  and  Neely  [73];  Reddy,  Er- 
man, Fennell,  and  Neely  [73];  Neely  [73]:  Erman 
[74]).  This  system,  which  was  the  first  demonstrable 
live  system  to  handle  non-trivlal  connected  speech, 
became  operational  in  June,  1972,  and  has  been 
since  augmented  and  studied  (Lowerre  [75]). 
Although  a number  of  simplifying  assumptions  were 
made  in  implementing  the  model,  Hearsayl  does  ad- 
dress the  problems  of  connected  speech  and  of  the 
role  and  interactions  of  different  kinds  of  knowledge. 
By  exhibiting  a successfully  working  system  which  is 
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based  on  a model  and  by  providing  a set  of  solutions 
to  these  problems  (even  if  some  of  the  solutions  are 
known  to  be  far  from  optimal),  Hearsayl  clarified  the 
problems  and  serves  as  a basis,  and  encourage- 
ment for  subsequent  work 

The  experience  of  building  and  experimenting 
with  Hearsayl,  together  with  the  other  research  in  the 
field,  led  to  a design  review  which  resulted  in  the 
Hearsayll  system  (Lesser,  Fennell,  Erman,  and  Red- 
dy (74))  Hearsayll  is  also  based  on  the  Hearsay 
model;  it  generalizes  and  extends  many  of  the  con- 
cepts which  exist  in  a more  simplified  form  in  the 
Hearsayl  system 

Concurrent  with  the  early  stages  of  the  Hearsay 
development,  a group  was  formed  by  the  Advanced 
Research  Projects  Agency  (ARPA)  to  study  the 
feasibility  of  developing  speech  understanding 
systems.  This  group,  which  mcludec  researchers  ac- 
tive in  artificial  intelligence  as  well  as  those  in  more 
traditional  directions  of  speech  recognition 
research,  produced  its  report  in  May,  1971.  This 
report  (Newell  et  al.  [71 ))  provides  a comprehensive 
and  detailed  analysis  of  the  problems  involved  Part 
of  this  study  included  the  specification  of  a set  of 
nineteen  dimensions  for  describing  the  capabilities 
of  a speech  understanding  system— the  first  column 
of  Figure  1 summarizes  those  dimensions. 

On  recommendation  of  the  study  group,  a five- 
year  ARPA  Speech  Understanding  Research  effort 
was  launched  in  October,  1971.  An  innovative  plan 
with  five  principal  contractors  (including  CMU)  was 
chosen;  each  was  to  aim  to  produce  a complete 
system  meeting  a set  of  specifications  laid  out  by  the 
study  group  (the  second  column  of  Figure  1)  and  all 
were  to  interact,  exchanging  ideas  and  data. 
Although  charged  to  meet  the  same  set  of 
specifications,  each  group  was  free  to  choose  its 
own  orientation  (and  task  domains)  Thus,  the  flavor 
of  each  of  the  systems  reflects  the  pa  iicular  exper- 
tises and  motivations  of  the  people  involved. 

The  Hearsay  research  represents  CMU's  major  ef- 
forts to  meet  the  ARPA  specifications;  in  particular,  it 
is  hoped  that  Hearsayll  will  accomplish  that  goal.  In 
addition,  several  other  systems  are  being 
experimented  with,  also  aiming  to  meet  these  goals: 
a version  of  the  Dragon  system  (Baker  [75])  being 
extended  by  Reddy  and  Lowerre  and  a combination 
of  Hearsayl  and  Dragon  (Lowerre  (75)). 

In  this  paper  we  will  describe  only  the  Hearsay  ef- 
fort An  IEEE  symposium  on  speech  recognition  was 
held  at  CMU  in  April,  1974,  at  which  most  workers  in 
the  field  were  represented.  The  contributed  and  in- 
vited papers  from  that  symposium  (Erman  (74b); 
Reddy  (75))  provide  a comprehensive  description  of 
the  state-of-the-art  at  that  time. 


3.  The  Hearsay  Model  and  Implementation 
Philosophy 

This  section  describes  a general  model  of  speech 
understanding,  the  Hearsay  model1  and  some  of 
the  problems  implied  by  that  model.  The  following 
two  sections  provide  overviews  of  the  Hearsayl  and 
Hearsayll  implementations  of  that  model 

As  one  knowledge  source  (KS)  makes  errors  and 
creates  ambiguities,  other  KSs  must  be  brought  to 
bear  to  correct  and  clarify  those  actions.  This  KS 
cooperation  should  occur  as  soon  as  possible  after 
the  introduction  of  an  error  or  amoiguity  in  order  to 
limit  its  ramifications  The  mechanism  used  for 
providing  this  high  degree  of  cooperation  is  the 
hypothesize-andtest  paradigm  In  this  paradigm, 
solution-finding  is  viewed  as  an  iterative  process. 
Two  kinds  of  KS  actions  occur,  a)  the  creation  of  an 
hypothesis,  an  "educated  guess'"  about  some  aspect 
of  the  problem,  and  b)  tests  of  the  plausibility  of  the 
hypothesis.  For  both  of  these  steps,  the  KS  uses  a 
prion  knowledge  about  the  problem,  as  well  as  the 
previously  generated  hypotheses.  This  "iterative 
guess-building"  terminates  when  a consistent  sub- 
set of  hypotheses  is  generated  which  satisfies  some 
specified  requirements  for  an  overall  solution. 

As  a strategy  for  developing  such  systems,  one 
needs  the  ability  to  add  and  replace  KSs  and  to 
explore  different  control  strategies.  Thus,  such 
changes  must  be  relatively  easy  to  accomplish;  there 
must  also  be  ways  to  evaluate  the  performance  of 
the  system  in  general  and  the  roles  of  the  various 
KSs  and  control  strategies  in  particular.  This  ability 
to  experiment  conveniently  wi’h  the  system  is  crucial 
if  the  amount  of  knowledge  is  large  and  many  people 
are  needed  to  introduce  and  validate  it.  One  means 
of  helping  to  provide  these  flexibilities  is  to  require 
that  KSs  be  independent;  i.e. , the  explicit  interac- 
tions between  KSs  and  their  assumptions  about 
each  other  must  be  minimal. 

Besides  providing  for  the  modification  and 
evaluation  of  KSs,  decomposition  of  the  system  into 
relatively  independent  KSs  also  facilitates  its  im- 
plementation on  an  asynchronous  multi-processor 
machine.  Such  configurations  seem  increasingly  at- 
tractive as  cost-effective  ways  of  obtaining  large 
amounts  of  computing  power.  One  problem  that  has 
limited  the  development  and  usage  of  such 
machines  is  the  difficulty  of  decomposing  large 
problems  for  such  machines  Erman,  Fennell, 
Lesser,  and  Reddy  [73]  describe  this  problem  and 
outline  some  early  solutions  in  the  Hearsay  context, 
Lesser  [75]  provides  a survey  of  this  subject 

The  basic  view  of  development  of  a speech  un- 
ders, ending  system  includes  a strong  component  of 
experimentation-  one  needs  to  build  a system  and 
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Dimensions  and  Examples 

( 1 ) Manner o I Speech 
connected?  isolated  words? 

(2)  Number  o I Speakers 

one?  small  set7  open  population? 

(3)  Dialect  and  Manner 
cooperative?  casual?  single  gender? 
both  genders?  children?  what  dialect(s)? 

(4)  Environmental  Conditions 

quiet  room?  computer  room?  factory?  public 
place? 

(5)  Trensducer 

high  quality  microph  ;ne?  telephone? 

(6)  Speaker  Tuning 

lew  sentences?  paragraphs?  full  vocabulary? 

(7)  Speeker  Training 

natural  adaptation?  elaborate? 

(8)  Vocebulary  Sue  end  Selection 
SO?  200?  1,000?  10,000? 
preselected?  selective  rejection?  free9 

(9)  Crammer 

fixed  phrases9  artificial  language?  free  English? 
adaptable? 

(10)  Tesk 

highly  constrained  (e  g , simple  retrieval)? 
focussed  (e  g numerical  algorithms)?  open? 

(11)  User  Model 

nothing  ? current  knowledge  about  the  user? 

(12)  Mode  of  Interaction 
response  only?  ask  for  repetitions? 
explain  language?  discuss  communications? 

(13)  Error  Ret e 

none  (<0  1%)?  <10%?  >20%? 

(14)  Response  Time 

no  hurry?  few  times  real-time?  immediate? 

(15)  Processing  Power 
1x10*  instructions/sec?  lOmips?  100  mips? 

1000  mips? 

(16)  Memory  Size 

1 megabit?  lOmb?  lOOmb?  lOOOmb? 

(17)  System  Orgenizetion 
simple  program?  multiprocessing9  parallel 
processing?  unidirectional  processing? 
feedback?  backtrack?  planning? 

(18)  Cost 

SO  001/sec.  of  speech?  50.01/s?  $0  1/s? 

$1.0/s? 

(19)  Operetionel  Dele 


ARPA  Specifications  for  1976  Systems 
The  system  should. 

( 1 ) accept  connected  speech 

(2)  from  many 

(3)  cooperative  speakers  of  the  general  American 
dialect", 

(4)  in  a quiet  room 

(5)  over  a good  quality  microphone 

(6)  allowing  slight  tuning  of  the  system  per  speaker, 

(7)  but  requiring  only  natural  adaptation  by  the 
user. 

(8)  permitting  a slightly  selected  vocabulary  of 
1,000  words, 

(9)  with  a highly  artificial  syntax, 

(10)  and  a task  with  a constrained  and  fairly  simple 
semantics, 

(11)  with  a simple  psychological  model  of  the  user, 

(12)  providing  graceful  interaction, 

(13)  tolerating  less  than  10%  semantic  error, 

(14)  in  a few  times  real-time, 

(15) 

(16) 

(17) 

(18) 

(19)  and  have  a prototype  demonstrable  in  1976. 


Figurt  1:  Dimention*  of  Speech  Understanding  Systems  and  ARPA  Specifications  for  1976. 

(After  Newell  et  al.  [71].) 
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then  manipulate  and  stud,  it  In  order  to  provide  an 
environment  to  accomplish  tins,  a two-level  ap 
proach  is  taken  First,  a basic  set  of  facilities  is 
provided,  and  second  various  configurations  are 
built  using  these  facilities.  These  facilities,  which 
together  are  called  the  kernel,  form  a problem- 
dependent  programming  system  for  building  and 
experimenting  with  particular  configurations  The 
correct  choice  of  kernel  facilities  and  their  im- 
plementation are  crucial  ingredients  in  developing  a 
system 

The  Black  board --Representation  of  Knowledge 

The  requirement  that  KSs  be  independent  implies 
that  the  functioning  (and  very  existence)  of  each 
must  not  be  necessary  or  crucial  to  the  others.  On 
the  other  hand,  tne  KSs  are  required  to  cooperate  in 
the  iterative  guess-building  using  and  correcting 
one  another  s guesses,  this  implies  that  there  must 
be  interaction  among  the  processes  These  two  op- 
posing requirements  have  led  to  a design  in  which 
each  KS  interfaces  to  the  others  externally  in  a un- 
iform way  that  is  identical  across  KSs  and  in  which 
no  knowledge  source  knows  what  or  how  many  othef 
KSs  exist.  The  interface  is  implemented  as  a 
dynamic  global  data  structure,  called  the 
blackboard 3 The  primary  units  in  the  blackboard 
are  the  guesses  about  particular  aspects  of  the 
problem— the  hypotheses  At  any  time,  the 
blackboard  holds  the  current  state  of  the  system,  it 
contains  all  the  guesses  about  the  problem  that 
exist.  Subsets  of  hypotheses  represent  partial 
solutions  to  the  entire  problem,  these  may  compete 
with  the  partial  solutions  represented  by  other 
(perhaps  overlapping)  subset:. 

Each  KS  may  access  information  in  the 
blackboard.  Each  may  add  information  to  the 
blackboard  by  creating  (or  deleting)  hypotheses,  by 
modifying  existing  hypotheses,  and  by  establishing 
or  modifying  explicit  structural  relationships  among 
hypotheses.  The  generation  and  modification  of 
globally  accessible  hypotheses  is  the  exclusive 
means  of  communication  among  the  diverse  KSs 
This  mechanism  of  cooperation,  which  is  an  im- 
plementation of  the  hypothesize-and-test  paradigm, 
allows  a KS  to  contribute  knowledge  without  being 
aware  of  which  other  KSs  will  use  the  information  or 
which  KS  supplied  the  information  that  it  used.  It  is  in 
this  way  that  KSs  are  made  independent  and 

3 The  term  "blackboard'  was  used  by  Simon  (66)  in 
describing  a mechanism  in  long-term  memory  as 
part  of  a theory  of  the  psychology  of  problem- 
solving. Simon  (71 1 further  develops  this  concept 
and  elaborates  its  uses  in  the  context  of  an  ab- 
stract model  for  problem-solving. 


separable  The  structural  relationships  form  a 
network  of  the  hypotheses  and  are  used  to  represent 
the  deductions  and  inferences  which  caused  a KS  to 
generate  one  hypothesis  from  others  The  explicit 
retention  in  the  blackboard  of  these  dependency 
relationships  is  used  to  hold  among  other  things, 
competing  hypotheses 

Because  of  the  central  importance  of  the 
blackboard,  its  design  (i.e  , the  design  of  the  struc- 
ture of  hypotheses  and  their  relationships)  is  crucial 
This  is  usually  called  the  problem  of  representation 

Activation  of  Knowledge  Sources— Focus  cl  A'.'an- 
tion 

An  action  of  a KS  in  the  blackboard  takes  place  in 
the  context  of  some  hypotheses  already  existing  in 
the  blackboard  For  example,  a KS  which 
hypothesizes  words  may  require  a stressed  vowel 
(as  well  as  some  surrounding  sounds)  as  its  context 
in  order  to  consider  generating  new  word 
hypotheses. 

At  any  time  there  may  be  many  different  contexts 
which  satisfy  the  needs  of  one  or  more  KSs  The 
problem  of  choosing  the  order  for  activating  KSs  on 
contexts  is  generally  called  the  problem  of  control 
How  Because  there  may  be  many  such  possible  ac- 
tivations and  because  each  activation  of  a KS  will,  iri 
general,  create  the  potential  for  even  more  ac- 
tivations (e  g , the  word  hypothesizer,  given  a single 
new  stressed  vowel  context,  might  hypothesize  five 
new  words  as  competing  candidates— each  of  these 
might  provide  ■*  new  context  for  a syntactic  parser), 
the  number  of  possible  activations  may  grow. 

If  very,  very  large  amounts  of  processing  power 
(and  memory)  were  avai'able,  one  could  consider 
actually  activating  all  KSs  in  all  their  possible  con- 
texts. This  would  expand  the  blackbou'o  with  many 
(competing)  hypotheses.  Assuming  this  would  even- 
tually terminate  (i.e  , at  some  point  no  new  contexts 
are  created),  a decision  process  could  then  try  to 
pick  from  all  the  competing  hypotheses  that  subset 
which  best  describes  the  data— this  would  be  the 
system’s  “solution"  to  the  problem  Because  of  this 
combinatoric  explosion  of  possibilities  (caused 
mostly  by  the  problems  of  variability  and  in- 
completeness in  the  signal  and  errorfulness  of  the 
KGs),  this  complete  expansion  Is  not  feasible 
Therefore,  the  control  strategy  can  pick  only  a small 
subset  of  the  applicable  KS  activations,  this  can  be 
thought  of  as  exploring  a limited  portion  of  the 
(potential)  fully-expanded  blackboard  The  problem 
of  choosing  a control  strategy  which  can  efficiently 
reach  the  correct  set  of  hypotheses  is  called  the 
attention-tocusing  problem  Its  solution  is  also 
critical  for  the  success  of  a system  If  portions  of  the 
correct  solution  are  pruned,  the  solution  will  never 
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Hearsayl  Performance 

The  Hearsayl  system  first  demonstrated  live. 
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improving  its  performance  The  system  has  been 
formally  tested  on  a set  of  144  connected  speech 
utterances,  containing  676  word  tokens,  spoken  by 
five  speakers,  and  consisting  of  four  tasks  (only  one 
of  which  has  had  a semantics  component 
programmed),  with  vocabularies  ranging  from  28  to 
76  words.  The  system  locates  and  correctly  identifies 
about  93%  of  the  wcr.ls,  using  all  three  of  its  KSs 
Without  the  use  of  the  semantics  KS,  the  accuracy 
decreases  to  70%  It  decreases  further  to  about  30% 
when  neither  syntax  nor  semantics  are  used  Hear- 
sayl  operates  n about  7 to  10  times  real-time  on  a 
PDP10  KA10  (0  3 million  instructions/sec  machine), 
using  about  120K  words  (36-bits/word)  for  storage 
and  programs. 

Hearsay I Design  Limitations 

There  are  four  major  design  decisions  in  the  Hear- 
sayl  implementation  of  knowledge  representation 
and  cooperation  which  make  it  difficult  to  directly  ex- 
tend Hearsayl  to  more  ambitious  performance  goals. 
The  first,  end  most  important,  of  these  limiting 
decisions  concerns  the  use  of  the  hypothesize-and- 
test  paradigm.  As  implemented  in  Hearsayl,  the 
paradigm  is  exploited  only  at  the  word  level.  That  is, 
the  information  content  of  any  hypothesis  in  the 
blackboard  is  limited  to  a description  at  the  word 
level.  The  addition  of  non-word  level  KSs  (i  e , KSs 
cooperating  via  either  sub-word  levels,  such  as 
syllables  or  phones,  or  via  supra-word  levels,  such 
as  phrases  or  concepts)  thus  becomes  cumbersome 
because  this  knowledge  must  somehow  be  related 
to  hypothesizing  and  testing  at  the  word  level. 

Secondly,  Hearsayl  constrains  the  hypothesize- 
and-test  paradigm  to  operate  in  a lockstep  control 
sequence.  The  effect  of  this  decision  is  to  limit 
parallelism  of  execution  (and  thus  reduce  effec- 
tiveness on  a multi-processor  configuration),  this  is 
because  the  time  required  to  complete  a 
hypothesize-and-test  cycle  is  the  maximum  time 
required  by  any  single  hypothesizer  KS  plus  the 
maximum  time  required  by  any  single  verifier 
(testing)  KS.  Another  disadvantage  of  this  control 
scheme  is  that  the  time  increases  for  the  system  to 
refocus  attention,  because  there  is  no  provision  for 
any  communication  of  partial  results  among  KSs 
Thus,  for  example,  a rejection  of  a particular  option 
word  by  a KS  will  not  be  noticed  until  all  the  KSs 
have  tested  all  the  option  words 
A third  weakness  in  the  Hearsayl  implementation 
concerns  the  structure  of  the  blackboard:  there  is  no 
provision  for  specifying  relationships  among  alter- 
native sentence  hypotheses,  This  absence  has  the 
effect  of  Increasing  the  overall  computation  time  and 
Increasing  the  time  to  refocus  attention,  because  the 
information  gained  by  working  on  one  hypothesis 


cannot  be  shared  by  propagating  it  to  other  relevant 
hypotheses 

A fourth  limit  ng  design  decision  relates  to  how  a 
global  problem-?  olving  strategy  is  implemented  in 
Hearsayl  The  policies  for  attention-focusing  and 
control  are  embedded  in  the  recognition  overlord 
module  in  an  ad  hoc  fashion— there  is  no  coherent 
structure  for  the  algorithms  and  they  are  wired  in 
to  the  kernel  of  the  system,  rather  than  being 
available  for  easy  manipulation  and  experimenta- 
tion Thus  it  is  awkward  to  modify  and  evaluate 
policy  algorithms. 

5.  Overview  of  Heartayl! 

Hearsayll  represents  the  step  following  Hearsayl 
in  the  sequence  of  increasingly  ambitious  systems 
for  speech  understanding.  The  major  changes  to  the 
system  structure  are  a)  in  the  representation  of 
knowledge  in  the  blackboard  and  b)  in  the  manner  of 
activation  and  attention  focusing  of  KSs. 

The  Blackboard  of  Hearsayll 

The  blackboard  has  been  extended  and 
generalized  to  allow  a)  the  representation  of  all  levels 
of  information  (acoustic,  phonetic,  syllabic,  etc  ) in 
addition  to  the  lexical  and  sentence  levels  of  Hear- 
sayl and  b)  the  explicit  representation  of 
relationships  among  hypotheses 

The  blackboard  is  partitioned  into  distinct  infor- 
mation levels,  each  level  is  used  to  hold  a different 
(and  potentially  complete)  representation  of  the 
utterance  Associated  with  each  level  is  a set  of 
primitive  elements  appropriate  for  representing  the 
problem  at  that  level  (For  example,  the  elements  at 
the  lex.^al  level  are  the  words  of  the  vocabulary  to  be 
recognized,  while  the  elements  at  the  phonetic  level 
are  the  phones  of  English.)  Each  hypothesis  exists  at 
a particular  level  and  is  labeled  as  being  a particular 
element  of  the  set  of  primitive  elements  at  that  level. 
The  choice  of  levels  (and  the  set  of  elements  at  each 
level)  is  not  prespecified  by  the  kernel  of  the 
system.  To  the  kernel,  all  levels  are  unitorm;  so  new 
ones  can  be  added  at  any  time.  The  configuration  of 
levels  that  is  currently  in  use  is  shown  in  Figure  2, 4 
Parametric  Level — The  parametric  level  holds  the 
most  basic  representation  of  the  utterance  that  the 
system  has,  it  is  the  only  direct  input  to  the 
machine  about  the  acoustic  signal.  Several  dif- 
ferent sets  of  parameters  are  being  used  in  Hear- 
sayll interchangeably:  1 /3-octave  filter-band 

energies  measured  every  10  msec.,  LPC-derived 
vocal-tract  parameters,  and  wide-band  eneigies 
and  zero-crossing  counts 


4 An  elaboration  of  the  following  description  can  be 
found  in  Shockey  and  Erman  [74], 


Overview  7 


Segmental  Level— This  level  represents  the  ut- 
tera' . as  labeled  acoustic  segments.  Althougf 
the  ini  of  labeis  is  phonetic-like,  the  level  is  not 
intended  to  be  phonetic— the  segmentation  and 
labeling  reflect  acoustic  manifestation  and  do 
not  for  example,  attempt  to  compensate  for  the 
context  of  the  segments  or  attempt  to  combine 
acoustically  dissimilar  segments  into  (phonetic) 
units. 

Phonetic  Level — At  this  level,  the  utterance  is 
represented  by  a phonetic  description  This  is  a 
broad  phonetic  description  in  that  the  size 
(duration)  of  the  units  is  on  the  order  of  the  "size 
of  phonemes;  it  is  a fine  phonetic  description 
to  the  extent  that  each  element  is  labeled  with  a 
fairly  detailed  allophonic  classification  (eg., 
"stressed,  nasalized  (1]  ") 

Surface-Phonemic  Level—  This  level,  named  by 
seemingly  contradicting  terms,  represents  the 
utterance  by  phoneme-like  units,  with  the  addition 
of  modifiers,  such  as  stress  and  boundary  (word, 
morpheme,  syllable)  markings 
Syllabic  Level—  The  unit  of  representation  here  is 
the  syllable. 

Lexical  Level—  The  unit  of  information  at  this  level 
is  the  word 

Phrasal  Level-  Phrases  appear  at  this  level.  In 
fact,  since  a level  may  contain  arbitrarily  many 
"sub-levels"  of  elements  (using  "links",  as 
described  below),  traditional  kinds  of  syntactic 
trees  are  directly  represented  here 
The  decomposition  of  the  blackboard  into  distinct 
levels  of  representation  can  also  be  thought  of  as  an 
a priori  framework  of  a plan  for  problem-solving. 
Each  level  is  a generic  stage  in  the  plan.  The  goal  at 
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Figure  2:  The  Levels  Currently  Used  in  Hearsayll. 


each  level  is  to  create  and  validate  hypotheses  at 
that  level  For  example,  the  goal  a?  the  phonetic  level 
is  a phonetic  transciption  of  the  utterance  The 
overall  goal  of  the  system  is  to  create  (using  links 
as  described  below)  the  most  plausible  network  of 
hypotheses  that  sufficiently  covers  the  levels 
"Plausible  and  sufficient"  here  refer  to  the  judgment 
of  the  KSs,  "covering  the  levels  means  a network 
that  connects  hypotheses  which  describe  the  speech 
signal  (at  the  parametric  level)  to  hypotheses  which 
describe  the  semantic  content  of  the  utterance  (at 
the  phrasal  level). 

The  decomposition  of  the  problem  space  into 
more  levels  than  in  Hearsayl  parallels  the  desire  to 
decompose  the  KSs  more  finely,  yielding  more  KSs, 
each  of  which  is  simpler  and  smaller  The  principal 
resultant  change  in  the  configuration  of  KSs  is  ‘hat 
the  single  acoustic-phonetic  KS  of  Hearsayl  is 
decomposed  into  about  six  KSs  currently  in  Hear- 
sayll For  most  KSs,  the  KS  needs  to  deal  with  only 
one  or  two  levels  to  apply  its  knowledge,  it  need  not 
even  be  aware  of  the  existence  of  other  levels  Thus, 
each  KS  can  be  made  as  simple  as  its  knowledge 
allows;  its  interface  to  the  rest  of  the  system  is  in  un- 
its and  concepts  which  are  natural  to  it.  Also,  new 
levels  can  be  added  as  new  KSs  are  designed  which 
need  to  use  them  (For  example,  the  syllabic  level 
was  a fairly  late  addition  to  the  configuration— only 
two  KSs  needed  to  be  modified  when  it  was  added.) 

Activation  of  Knowledge  Sources 

A KS  is  instantiated  as  a knowledge-source 
process  whenever  the  blackboard  exhibits 
characteristics  which  satisfy  a "precondition"  of  the 
KS.  A precondition  of  a KS  is  a description  of  some 
partial  state  of  the  blackboard  which  defines  when 
and  where  the  KS  can  contribute  its  knowledge  by 
modifying  the  blackboard.  A KS  carries  out  these  ac- 
tions with  respect  to  a particular  context,  the  context 
being  some  arbit.ary  subset  of  the  previously 
generated  hypotheses  in  the  blackboard  Thus,  new 
hypotheses  or  modifications  to  existing  hypotheses 
are  constructed  from  the  (static)  knowledge  of  the 
KS  and  the  educated  guesses  made  at  some 
previous  time  by  another  KSs 

The  modifications  made  by  any  given  KS  process 
are  expected  to  trigger  further  KSs  by  creating  new 
conditions  in  the  blackboard  to  which  those  KSs,  in 
turn,  respond.  The  structure  of  a hypothesis  is 
designed  to  allow  the  preconditions  of  most  KSs  to 
be  sensitive  to  a single,  simple  change  in  some 
hypothesis  (e  g.,  the  creation  of  a new  hypothesis  of 
a particular  type,  a change  of  a rating,  or  the  creation 
of  a structural  link  between  particular  kinds  of 
hypotheses).  Through  this  data-directed  interpreta- 
tion of  the  hypothesize-and-test  paradigm,  KSs  can 
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Figure  3:  The  Current  Knowledge  Source*  in  Heersayll. 


also  exhibit  a high  degree  of  asynchronous  activity 
and  potential  parallelism.5 

As  examples  of  KSs,  Figure  3 shows  many  of  the 
current  set.  The  levels  are  indicated  by  horizontal 
lines  in  the  figure  and  are  labeled  at  the  left.  The  KSs 
are  indicated  by  arcs  connecting  levels;  the  starting 
polnt(s)  of  an  arc  indicates  the  level(s)  of  major  "in- 
put" for  the  KS,  and  the  end  point  indicates  the  "out- 
put" level  where  the  KS’s  major  actions  occur.  In 
general,  the  action  of  most  of  these  particular  KSs  is 
to  create  links  between  hypotheses  on  its  input 
level(s)  and:  1)  existing  hypotheses  on  its  output 
level,  if  appropriate  ones  are  already  there,  or  2) 
hypotheses  that  it  creates  on  its  output  level. 

5 One  might  think  of  this  model  for  data-flirected 
activation  of  KSs  as  a production  system  (Newell 
(73])  which  is  executed  asynchronously.  The  pre- 
conditions correspond  to  the  left-hand  sides 
(conditions)  of  productions,  and  the  KSs  cor- 
respond to  the  right-hand  sides  (actions)  of  the 
productions.  Conceptually,  these  left-hand  sides 
are  evaluated  continuously.  When  a precondition 
is  satisfied,  an  instantiation  of  the  corresponding 
right-hand  side  of  its  production  is  created;  this 
instantiation  is  executed  at  some  arbitrary  sub- 
sequent time  (perhaps  subject  to  instantiation 
scheduling  constraints). 


The  Segmenter-CiassiUer  KS  uses  the  parametric 
descripticn  of  the  speech  signal  to  produce  a 
labeled  acoustic  segmentation.  (See  Goldberg  et 
al.  [75]  for  a description  of  the  algorithm  used.) 
For  any  portion  of  the  utterance,  several  possible 
alternative  segmentations  and  labels  may  be 
produced. 

The  Segment  Combiner  combines  similar  adja- 
cent segments  into  larger  units,  it  is  triggered  on 
each  new  hypothesis  at  the  segmental  level. 

The  Phone  Synthesizer  uses  labeled  acoustic 
segments  to  generate  elements  at  the  phonetic 
level.  This  procedure  Is  sometimes  a fairly  direct 
renaming  of  an  hypothesis  at  the  segmental  level, 
perhaps  using  the  context  of  adjacent  segments. 
In  other  cases,  phone  synthesis  requires  the  com- 
bining of  several  segments  (e  g.,  the  generation 
of  [t]  from  a segment  of  silence  followed  by  a seg- 
ment of  aspiration)  or  the  Insertion  of  phones  not 
indicated  directly  by  the  segmentation  (e.g.,  hypoth- 
esizing the  existence  of  an  [I]  if  a vowel  seems 
velarlzed  and  there  is  no  [I]  in  the  neighborhood). 
This  KS  Is  triggered  whenever  a new  hypothesis  is 
created  at  the  segmental  level. 

The  Word  Candidate  Generator  uses  phonetic 
Information  (primarily  just  at  stressed  locations 
and  other  areas  of  high  phonetic  reliability)  to 
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generate  word  hypotheses  This  is  accomplished 
in  a (wo-stage  process  with  a stop  at  the  syllabic 
level  from  which  lexical  retrieval  is  more  effective. 
(In  lact,  there  are  really  two  separate  KSs  here— 
one  that  goes  from  phones  to  syllables,  and  one 
that  goes  from  syllables  to  words.) 

The  Phoneme  Hypothesizer  KS  is  activated  when- 
ever a word  hypothesis  is  created  (at  the  lexical 
level)  which  is  not  yet  supported  by  hypotheses  at 
the  surface  phonemic  level.  Its  action  is  to  create 
one  or  more  sequences  at  the  surface- phonemic 
level  which  represent  alternative  pronunciations 
of  the  word  (These  pronunciations  are  pre- 
specified as  entries  in  a dictionary.)  It  also  creates 
the  syllable  hypotheses  for  the  word,  if  they  do  not 
already  exist 

The  Phone— Phonemo  Synchronizer  is  triggered 
whenever  an  hypothesis  is  created  at  either  the 
phonetic  or  the  surface-phonemic  level  This  KS 
attempts  to  link  up  the  now  hypothesis  with  hypoth- 
eses at  the  other  level  This  linking  may  be 
many-to-one  in  either  direction 

The  Syntactic-Semantic  Parser  uses  the  syntactic 
and  semantic  definition  of  the  input  language  to 
build  parses  at  the  phrasal  level  It  is  triggered 
by  new  word  and  phrasal  hypotheses.  This  KS  is 
not  restricted  to  left-to-right  parsing,  but  rather 
works  piecemeal  wherever  hypotheses  occur. 
One  of  its  responsibilities  is  to  identify  possible 
interpretations  for  the  entire  utterance.  (See 
Hayes-Roth  and  Mostow  (75j.) 

The  Syntactic-Semantic  Hypothesizer  also  uses 
the  syntactic  and  semantic  definition  of  the  input 
language  It  hypothesizes  phrasal  and  word 
hypotheses  which  are  likely  to  occur  adjacent 
to  phrasal  and  word  hypotheses  already  on  the 
blackboard.  This  provides  top-down''  activity  in 
the  system. 

The  Rating  Policy  KS  operates  at  all  levels  of  the 
blackboard.  Its  function  is  to  propagate  evalua- 
tions of  hypotheses.  For  each  hypothesis,  this  KS 
calculates  ratings  which  are  based  on  a)  intrinsic 
ratings  placed  on  the  hypothesis  by  other  KSs  and 
b)  the  hypothesis'  relationships  to  other  hypoth- 
eses. 

Hypotheses;  Structure  and  Interrelationships 

As  described  above,  the  structure  of  hypotheses 
at  each  level  in  the  blackboard  is  identical  (l.e.,  the 
interpretation  of  hypotheses  at  different  levels  is  im- 
posed by  the  KSs  dealing  with  them.)  The  internal 
structure  of  an  hypothesis  consists  of  a fixed  set  of 
attributes  (i  e .,  fields  which  are  named);  this  set  is  the 
same  tor  hypotheses  at  all  levels  of  representation  in 


the  blackboard  The  values  of  the  attributes  are  set 
and  modified  by  the  KSs. 

Besides  holding  information  necessary  to 
describe  the  hypothesis,  attributes  also  serve  as 
mechanisms  for  implementing  the  data-directed 
hypothesize-and-test  paradigm.  That  is,  a KS  can 
specify  particular  attributes  of  hypotheses  (usually  at 
particular  levels)  which  it  wants  to  have  monitored' 
whenever  a change  is  made  to  one  of  these 
monitored  attributes,  the  KS  (through  its  precondi- 
tion) can  be  activated  and  notified  of  the  nature  of 
the  change 

Attributes  can  be  grouped  into  several  classes: 

— The  first  class  of  attributes  names  the  hypoth- 
esis It  contains  the  unique  name  of  the  hypoth- 
esis, the  name  of  its  level,  and  its  label  from  the 
element  set  at  that  level 

-One  very  important  set  of  attributes  specifies 
structural  relationships  with  other  hypotheses, 
as  described  below 

The  next  class  of  attributes  is  composed  of 
parameters  which  rate  the  hypothesis.  These 
include  separate  numerical  ratings  derived 
from  a)  a priori  information  about  the  hypoth- 
esis (usually  placed  on  the  hypothesis  by  its 
creator  KS),  and  b)  information  derived  from  its 
relationships  to  other  hypot  eses. 

— Another  set  of  attributes  contains  information 
about  KS  attention  to  the  hypothesis.  These  in- 
clude suggestions  (by  KSs)  of  what  type  of 
further  processing  should  occur.  These  sug- 
gestions are  goals 

— For  speech,  time  is  a fundamental  concept,  so 
the  Hearsayll  system  has  a class  of  attributes 
for  describing  the  begin-  and  end-time  and  the 
duration  of  the  event  which  the  hypothesis  re- 
presents. These  attributes  include  ways  of 
explicitly  representing  fuzzy  notions  of  the 
times.  Besides  its  descriptive  importance,  the 
time  attribute  ciass  is  used  to  partition  the 
blackboard  for  efficient  access;  e.g.,  a KS  can 
retrieve  hypotheses  which  overlap  a particular 
time  region.  Using  both  time  and  level,  a two- 
dimensional  partitioning  occurs. 

-The  capability  for  arbitrary  KS-specific  attri- 
butes is  also  included.  This  can  be  used  by  a KS 
to  hold  arbitrary  information  about  the  hypoth- 
esis; in  this  way  a KS  need  not  hold  state  in- 
formation about  the  hypothesis  internally  across 
activations  of  the  KS  and  allows,  for  example, 
the  implementation  of  generator  functions.  If 
several  KSs  share  knowledge  of  the  name  of 
one  of  these  attributes,  each  of  them  can  access 
and  modify  the  attribute’s  value  and  thus  com- 
municate just  as  if  it  were  a "standard"  attribute; 
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this  can  be  used  as  an  escape  mechanism  for 
explicit  KS  intercommunication, 

— A unique  class  of  hypothesis  attributes,  called 
processing  state  attributes,  contains  succinct 
summaries  and  classifications  of  the  vaiues  of 
the  other  attributes.  For  example,  the  vaiues  of 
the  rating  attributes  are  summarized  and  the 
hypothesis  is  classified  as  either  "unrated" 
neutral ' (noncommittal),  "verified",  "guaran- 
teed (strongly  verified  and  unique),  or  re- 
jected". Other  processing  state  attributes  sum- 
marize tf  a structural  relationships  with  other 
hypotheses  and  characterize,  for  example, 
whether  the  hypothesis  has  been  "sufficiently 
and  consistently"  described  as  an  abstraction 
of  hypotheses  at  lower  levels.  The  processing 
state  attributes  are  especially  useful  for  ef- 
ficiently triggering  KSs;  for  example,  a KS  may 
specify  in  its  precondition  that  It  is  to  be 
activated  whenever  a hypothesis  at  a particular 
level  becomes  "verified"  These  attributes  are 
also  used  for  the  goal-directed  scheduling  of 
KSs,  as  described  in  the  next  section. 

Given  a specific  hypothesis,  a KS  can  examine  the 
value  of  any  of  its  attributes,  A KS  source  also  needs 
the  ability  to  retrieve  sets  of  hypotheses  whose  at- 
tributes satisfy  conditions  in  which  the  KS  is  in- 
terested; e g.,  a !<  ‘ may  want  to  find  all  hypotheses 
at  the  phonetic  level  which  are  vowels  and  which  oc- 
cur within  a particular  time  range.  The  system 
provides  an  associative  retrieval  search  mechanism 
for  accomplishing  this.  The  search  condition  is 
specified  by  a matching  prototype  which  is  a partial 
specification  of  the  components  of  a hypothesis. 

Structural  relationships  between  hypotheses  in 
the  blackboard  are  represented  through  the  use  of 
links-,  links  provide  a means  of  specifying  contextual 
abstractions  about  the  relationships  of  hypotheses. 
A link  is  an  element  of  the  blackboard  which 
associates  two  hypotheses  as  an  ordered  pair;  one 
of  the  nodes  is  termed  the  upper  hypothesis,  and  the 
other  Is  called  the  lower  hypothesis.  The  lower 
hypothesis  is  said  to  support  the  upper  hypothesis 
while  the  upper  hypothesis  is  called  a use  of  the 
lower  one;  In  general,  the  lower  hypothesis  is  at  the 
same  or  a lower  level  in  the  blackboard  than  the  up- 
per hypothesis. 

There  are  several  types  of  links,  with  the  types 
describing  various  kinds  of  relationships.  Consider 
this  structure: 


HI  is  the  upper  hypothesis  and  H2  H3  and  H4  are 
the  lower  hypotheses  of  links  LI , L2  and  L3,  respec- 
tively. If  the  links  are  all  of  type  OR,  the  interpretation 
is  that  HI  is  either  an  H2  or  an  H3  or  an  H4  This  is 
one  way  that  alternative  descriptions  are  possible  If 
the  links  in  the  figure  are  of  type  AND,  the  interpreta- 
tion is  that  all  of  the  lower  hypotheses  are  necessary 
to  support  the  existence  of  HI  Variants  of  the  AND- 
and  OR-links  are  also  used  An  imp  ^.tant  one  is  the 
SEQUENCE  link,  it  is  similar  to  the  AND-linx  except 
that  a contiguous  time-oi  dering  is  implied  on  the  set 
of  lower  hypotheses  supporting  the  upper 
hypothesis— if  the  links  in  the  figure  are  SEQUENCE 
links,  then  H4  follows  H3  which  follows  H2 

Besides  showing  structural  relationships  between 
hypotheses  (e  g , that  one  hypothesis  is  composed 
of  several  other  units),  a link  Is  a statement  about  the 
degree  to  which  one  hypothesis  implies  (i.e.,  "gives 
evidence  for  the  existence  of")  another  hypothesis. 
The  strength  of  the  implication  is  held  as  attributes  of 
the  link.  The  sense  of  the  implication  may  be 
negative;  that  Is,  a link  may  indicate  that  one 
hypothesis  Is  evidence  for  the  /^validity  of  another. 
This  statement  of  implication  may  be  bidirectional; 
the  existence  of  the  upper  hypothesis  may  give 
credence  to  the  existence  of  the  lower  hypothesis 
and  vice  versa.  Finally,  these  relationships  can  be 
constructed  in  an  iterative  manner;  links  can  be 
added  between  existing  hypotheses  by  KSs  as  they 
discover  new  evidence  for  support. 

Just  as  an  hypothesis  can  have  more  than  one 
lower  link,  so  it  can  have  several  upper  links.  Each  of 
these  represents  a different  use  of  the  hypothesis; 
the  uses  may  be  competing  or  complementary.  The 
ability  to  have  multiple  uses  and  supports  of  the 
same  hypothesis,  as  opposed  to  creating  duplicates 
for  each  competing  use  and  abstraction,  serves  to 
keep  the  blackboard  compact  and  thereby  reduces 
the  combinatoric  explosion  in  the  search  space. 
Further,  since  all  the  information  about  the 
hypothesis  is  localized,  all  uses  and  supports  of  the 
hypothesis  automatically  and  immediately  share  any 
new  information  added  to  the  hypothesis  by  any 
KSs.  As  changes  are  made  to  a hypothesis,  some  of 
its  uses  and  supports  may  conflict  with  each  other;  if 
these  conflicts  become  too  large,  a KS  can  decide  to 
resolve  them  by  either  eliminating  some  of  the  con- 
flicting attributes  or  by  splitting  the  hypothesis  Into 
two  or  more  hypotheses,  each  of  which  is  more  inter- 
nally consistent, 

Goal-Directed  Scheduling  of  Knowledge  Sources 

As  described  earlier,  the  overall  goal  of  the  system 
is  to  create  the  most  plausible  network  of 
hypotheses  that  sufficiently  spans  the  levels.  At  any 
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instant  of  time  the  blackboard  may  contain  many  in- 
complete networks,  each  of  which  is  plausible  as  far 
a?  it  goes  Some  of  these  incomplete  networks  may 
al  share  subnetworks.  Through  KS  activity,  in- 
complete networks  can  be  expanded  (or  contracted) 
and  may  be  joined  together  (or  fragmented).  At  any 
time,  there  may  be  many  places  in  the  blackboard 
which  satisfy  the  (precondition)  contexts  for  the  ac- 
tivation of  particular  KSs  The  task  of  goal-directed 
scheduling  is  that  of  deciding  which  of  these  sites 
should  be  allocated  computing  resources. 

Several  of  the  attribute  classes  of  a hypothesis 
can  be  helpful  in  making  scheduling  decisions  Par- 
ticularly valuable  are  the  values  of  the  attention  at- 
tributes, which,  as  described  earlier,  are  indicators 
telling  how  much  computation  has  been  expended 
on  the  hypothesis  and  suggestions  by  KSs  of  how 
desirable  it  is  to  devote  further  effort  on  the 
hypothesis  (along  with  the  kinds  of  processing  that 
are  desirable)  The  processing-state  attributes  and 
the  ratings  are  also  valuable  for  making  scheduling 
decisions. 

The  implementation  of  the  goal-directed 
scheduling  strategy  is  separated  from  the  actions  of 
individual  KSs  That  is,  the  decision  of  whether  a KS 
can  contribute  in  a particular  context  is  local  to  the 
KS,  while  the  assignment  of  that  KS  to  one  of  the 
many  contexts  on  which  it  can  possibly  operate  is 
made  more  globally  The  three  aspects  of  a) 
decoupling  of  focusing  strategy  from  KS  activity,  b) 
decoupling  of  the  data  environment  (blackboard) 
from  the  control  flow  (KS  activation),  and  c)  the 
limited  context  in  which  a KS  operates,  together  per- 
mit a quick  refocusing  of  attention  of  KSs.  The  ability 
to  refocus  quickly  is  very  important  because  the 
errorful  nature  of  the  KS  activity  leads  to  many  in- 
complete and  possibly  contradictory  nypothesis 
networks;  thus,  as  soon  as  possible  after  a network 
no  longer  seems  promising,  the  resources  of  the 
system  should  be  employed  elsewhere. 

Implementation  and  Current  Status 

Hearsay  II  is  implemented  (as  was  Hearsayl)  on 
the  PDP10  in  SAIL  (VanLehn  [73]),  an  extended 
Algol-60.  A number  of  language  mechanisms— par- 
ticularly the  flexible  macro  facility— are  used  to  ex- 
tend the  language  to  include  the  kernel  of  the  Hear- 
sayll  system;  the  result  is  a problem-oriented 
programming  system  for  writing  KSs  and  exploring 
various  configurations.  The  major  facilities  provided 
include 

— KS  definition  facilities, 

— blackboard  accessing  routines— both  direct 

and  associative  retrieval, 

— blackboard  modification  routines, 


a schedulei  which  activates  KSs, 

— an  overlay  facility  which  extends  the  256K-word 
address  space  so  that  large  configurations  can 
be  used, 

blackboard  monitoring  and  tracing  facilities, 
general-purpose  tools  for  experimenter  inter- 
action with  KSs,  including  breakpoints,  execu- 
tion tracing,  examination  and  modification  of 
variables,  and  execution  of  functions  of  the  KS, 

— tools  for  building  high-level  debugging  and 
interactive  features  that  are  KS-specific, 

— a package  for  graphical  output  of  blackboard 
structures, 

— a timing  package  for  determining  execution 
costs,  and 

— a means  of  reading  cliche"  files— stored 
sequences  of  commands  used  for  configuring 
and  controlling  the  system. 

The  system  that  results  is  highly  structured  and  has 
many  conventions  to  which  the  participating  re- 
searchers must  adhere.  This  is  necessary  in  order 
to  maintain  a system  that  many  people  are  modifying 
and  using  concurrently.  (There  are  currently  about 
five  people  maintaining  and  modifying  the  kernel 
and  approximately  a dozen  others  experimenting 
with  various  KS  configurations— a usable  and  up-to- 
date  system  must  be  operational  at  all  times.) 

The  Ternel  has  been  operational  since  spring, 
1974,  and  has  gone  through  several  major  im- 
plementation iterations.  All  the  KSs  described  above 
are  operational;  several  of  them  represent  second  or 
third  generation  versions.  Because  the  overlay  facili- 
ty has  only  just  come  up  (summer,  1975),  perfor- 
mance of  the  system  as  a whole  is  still  unknown;  the 
KSs  have  been  developed  using  small  con- 
figurations at  a time.  It  is  expected  that  preliminary 
over-all  performance  information  will  be  available  by 
the  end  of  1975,  but  development  will  continue  over 
the  foreseeable  future— as  long  as  progress  con- 
tinues to  be  made 

Although  Hearsayll  is  running  on  a uni-processor, 
it  is  implemented  using  multiple  processes.  The 
asynchronous  nature  of  KS  activation  raises  a 
number  of  issues  related  to  interaction  on  the 
blackboard.  In  particular,  because  the  execution  of  a 
KS  may  be  delayed  for  an  arbitrary  period  following 
the  blackboard  modification  which  triggered  the  KS, 
it  is  possible  that  intervening  actions  (of  other  KSs) 
may  have  invalidated  its  triggering  conditions  by  the 
time  that  it  actually  executes.  Mechanisms  have 
been  developed  to  handle  these  problems.  This 
aspect  of  the  research  is  described  in  Lesser, 
Fennell,  Erman,  and  Reddy  (74],  Fennell  (75],  and 
Fennell  and  Lesser  (75]. 

The  Hearsayll  system  also  contains  facilities  for 
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simulating  Its  execution  on  a multi-processor 
machine  Here  the  Issues  of  process  interference, 
resource  locking  (and  process  deadlock)  and 
processor  utilization  are  met.  The  papers  reff  renced 
in  the  preceding  paragraph  also  describ  > these 
aspects  in  detail.  The  simulations,  using  just  a subset 
of  the  current  KSs‘,  indicate  that  Hearsayll  can  effec- 
tively utilize  as  many  as  twelve  processors,  with  even 
more  likely  as  the  other  KSs  are  added  and  as  the 
scheduler  is  improved  to  reduce  conflicts. 

A preliminary  Implementation  of  the  Hearsayll 
kernel  has  been  carried  out  on  C.mmp  (CMU  s multi- 
minl-processor)  This  has  validated  the  multi- 
processing design  of  the  system.  This  implementa- 
tion has  been  accomplished  using  the  L*  system 
(Newell  and  Robertson  (75])  Much  of  the  further  in- 
vestigation of  Hearsayll  v/ill  take  place  in  this  con- 
text. 


Acknowledgment* 

A significant  portion  of  the  CMU  Computer 
Science  Department  has  been  involved  in  the  Hear- 
say efforts— the  author  is  only  one  of  many. 

— Raj  Reddy  is  responsible  for  a large  measure  of 
the  ideas,  energy,  and  vision  of  this  work,  with- 
out him,  the  project  would  not  have  existed 

— Richard  Fennell,  Rick  Hayes-Roth,  Victor  Les- 
ser, Richard  Neely,  and  Linda  Shockey  have 
been  instrumental  in  the  ideas  and  their  ex- 
ploration. 

— Allen  Newell  has  provided  guidance  and  en- 
couragement. 

— Greg  Gill  deserves  special  mention  for  his  out- 
standing contribution  of  programming  the 
Hearsayll  kernel  (several  times). 

— Without  taking  all  the  space  needed  to  describe 
each  of  their  contributions,  we  would  like  to 
acknowledge  the  efforts  of  all  members  of  the 
CMU  "speech  group",  as  well  as  the  entire  com- 
puter science  department,  for  contributing  to 
a first-rate  research  environment. 

I wish  to  thank  Vic  Lesser  for  considerable  nelp  with 
this  paper. 


6 Only  a subset  of  five  KSs  was  used  for  these 
simulations  because  a)  the  overhead  of  simulation 
Is  very  high  and  b)  when  the  simulations  began 
many  of  the  current  KSs  either  did  not  exist  or 
were  too  undeveloped  to  use. 


Overview  13 


References 

Baker  |75)  Baker  J K.  Stochastic  modeling  as  a 
means  of  automatic  speech  recognition  Doctoral 
Dissertation.  Computer  Science  Dept.,  Carnegie- 
Mellon  University  Pittsburgh,  PA,  1975 

Erman  Fennell,  Lesser  and  Reddy  (73)  Erman,  L.  D., 
R.  D.  Fennell,  V R.  Lesser  and  D.  R Reddy  Sys- 
tem organizations  for  speech  understanding 
Implications  of  network  and  multiprocessor  cnm- 
puter  architectures  for  Al.  Proc.  3rd  Inter  Ju.nt 
Coni  on  Artilicial  Intel.,  Stanford,  CA,  1973,  194- 
199. 

Erman  |74]  Erman,  L D An  environment  and  system 
for  machine  understanding  of  connected  speech. 
Doctoral  Dissertation,  Computer  Science  Dept., 
Stanford  University;  Technical  Report,  Computer 
Science  Dept.,  Carnegie-Mellon  University  Pitts- 
burgh, PA, 1°74 

Erman  |74b|  Err, an  L D (Ed.)  Contributed  Papers 
o I the  IEEE  Symposium  on  Speech  Recognition. 
April  15-19  1974,  Pittsburgh,  Pa.,  IEEE  Cat.  No. 
74CH0878-9AE.  Many  of  these  papers  lave  been 
reprinted  in  a special  issue  of  IEEE  Trans,  on 
Acoustics,  Speech,  and  Signai  Processing,  ASSP- 
23.  1 (1975). 

Fennell  (75]  Fennell,  R.  D Multip-ocess  software 
architecture  for  M problem  solving.  Doctoral  Dis- 
sertation, Computer  Science  Dept.,  Carnegie- 
Mellon  University,  Pittsburgh,  PA,  1975. 

Fennell  and  Lesser  |75]  Fennell,  R.  D and  V.  R. 
Lesser  Parallelism  In  Al  problem  solving  A case 
study  of  Hearsayll.  Sagamore  (NY)  Computer 
Conf.  on  Parallel  Processing,  1975. 

Goldberg  et  al.  1 74 ] Goldberg,  H.  G.,  D.  R.  Reddy, 
and  R Suslick.  Parameter  independent  machine 
segmentation  and  labeling.  In  Erman  |74b],  106- 
111. 

Hayes-Roth  and  Mostow  (75]  Hayes-Roth,  F.  and 
D.  J.  Mostow.  An  automatically  compilable  rec- 
ognition network  for  structured  patterns.  Proc.  4th 
Inter  Joint  Coni  on  Artificial  Intel.,  Tbilisi,  USSR, 
1975. 

Lesser,  Fennell,  Erman,  and  Reddy  [74]  Lesser, 
V.  R.,  R.  D.  Fennell,  L.  D.  Erman,  and  D.  R.  Reddy. 
Organization  of  the  Hearsayll  speech  under- 
standing system.  In  Erman  (74b],  11-21.  Also 
appeared  in  IEEE  Trans  on  Acoustics,  Speech, 
and  Signal  Processing,  ASSP-23, 1,(1975),  11-23. 

Lesser  (75  ] Lesser,  V.  R Parallel  processing  in 
speech  understanding  systems'  A survey  of 
design  problems.  In  Reddy  |75],  481-499. 

Lowerre  (75]  Lowerre,  B.  T.  Doctoral  Dissertation 
(in  preparation).  Computer  Science  Dept.,  Car- 
negie-Mellon University,  Pittsburgh,  PA  1975. 


Neely  [73]  Neely,  R.  B On  the  use  of  syntax  and 
semantics  in  a speech  understanding  system 
Doctoral  Dissertation,  Stanford  University;  Tech- 
nical Report,  Computer  Science  Dept.,  Carnegie- 
Mellon  University,  Pittsburgh,  PA,  1973. 

Newell  et  al  (71  ] Newell,  A , J.  Barnett,  J Forgie, 
C.  Green,  D.  Klaft,  J.  C.  R.  Lickliuer,  J Munson  R 
Reddy,  and  W Woods.  Speech  Understanding 
Systems  Final  Report  o I a Study  Group.  Com- 
puter Science  Dept , Carnegie-Mellon  University, 
Pittst  Jrgh,  PA,  1971.  Also  Elsevier/North-Hol- 
land.  Amsterdam,  1973 

Newell  "3)  Newell,  A Production  systems:  Models 
of  control  structures.  In  W C.  Chase  (Ed  ), 
Visual  Inlormatun  Processing,  Academic  Press, 
NY,  463-526 

Newell  and  Robertson  ]75)  Newell,  A and  G.  Robert- 
son. Some  issues  in  programming  multi-mini- 
processors. Behav  Res.  Methods  and  Instr.,  7, 
2,  75-86 

Reddy  (661  Reddy,  D.  R.  An  approach  to  computer 
speech  recognition  by  direct  analysis  of  the 
speech  wave.  Doctoral  Dissertation,  Al  Memo  No. 
43,  Computer  Science  Dept.,  Stanford  University, 
Stanford,  CA,  1966. 

Reddy  and  Vicens  |68l  Reddy,  D.  R.  and  Vicens, 
P.  J A procedure  for  segmentation  of  connected 
speech.  J Audio  Engr.  Soc.,  16,  4 (1968). 

Reddy,  Erman,  and  Neely  (70]  Reddy,  D.  R.,  L.  D. 
Ermar,  and  R B.  Neely,  The  CMU  speech  rec- 
ognition project.  Proc.  IEEE  System  Sciences  and 
Cybernetics  Coni.,  Pittsburgh,  PA,  1970. 

Rfddy|71]  Reddy,  D R.  Speech  recognition:  Pros- 
pects for  the  seventies  Proc.  IFIP  1971,  Ljubljana, 
Yugoslavia,  Invited  paper  section  1-5  to  1-13. 

Reddy,  Erman,  and  Neely  |72]  Reddy,  D.  R.,  L.  D. 
Erman,  and  R.  B.  Neely,  A mechanistic  model  of 
speech  perception.  Computer  Science  Research 
Review  <971-72,  Computer  Science  Dept.,  Car- 
negie-Mellon University,  Pittsburgh,  PA,  1972, 
7-15. 

Reddy,  Erman,  and  Neely  (73]  Reddy,  D.  R.,  L.  D. 
Erman,  and  R B Neely,  A model  and  a system  for 
machine  ren  * on  of  speech.  IEEE  Trans.  Audio 
and  Elec  u ~s,  AU-21,  3,  1973,  229-238. 

Reddy,  Err  and  Neely  (73]  Reddy,  D.  R., 

L.  D.  Errr  n f ennell,  and  R.  B.  Neely,  The 
Hearsay  si  i understanding  system:  An  ex- 
ample of  the  recognition  process  Proc.  3rd 
Inter.  Joint  Coni,  on  Artilicial  Intel.,  Stanford,  CA, 
1973,  185-193. 


Overview  14 


Reddy  |75 1 Reddy  D R (Ed  ),  Speech  Recognition: 
Invited  Papers  o t the  IEEE  Symposium,  April  15- 
19,  1974,  Pittsburgh,  PA,  Academic  Press,  NY, 
1975, 

Shockey  and  Erman  [74 ] Shockey,  L and  L.  D 
Erman  Sub-'exical  levels  in  the  Hearsayll  speech 
understanding  system,  In  Erman  |74b],  208-210 

Simon  [66]  Simon,  H A,  Scientific  discovery  and  the 
psychology  of  problem  solving.  Mind  and  Cosmos: 
Essays  in  Contemporary  Science  and  Philosophy, 
Series  in  Philosophy  of  Science,  University  of 
Pittsburgh.  Pittsburgh,  PA,  (1966),  3,  22-40. 

Simon  [71]  Simon,  H A.  The  theory  of  problem 
solving  In  Intormation  Processing  71,  North- 
Holland,  1971,  261-277, 

VanLehn  [73]  VanLehn,  K A SAIL  User  Manual. 
Memo  AIM-204,  Stanford  Artificial  Intelligence 
Laboratory,  Stanford  University,  Stanford,  CA, 
1973 

Vicens  [69]  Vicens,  P Aspects  of  speech  recogni- 
tion. Doctoral  Dissertation,  Report  CS-127,  Com- 
puter Science  Dept.,  Stanford  University,  Stan- 
ford, CA,  1969. 


Overview  15 


A MULTI-LEVEL  ORGANIZATION  FOR  PROBLEM  SOLVING 
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ABSTRACT 

An  organization  is  presented  for  implementing  solutions  to 
knowledge  based  A!  problems.  The  hypothesize  and-test 
paradigm  is  used  as  the  basis  for  cooperation  among  many 
diverse  and  independent  knowledge  sources  (KS’s),  The  KS’s 
are  assumed  individually  to  bo  errorful  and  incomplete 

A uniform  and  integrated  mutli-level  structure,  the 
blackboard,  holds  the  current  stale  of  the  system.  Knowledge 
sources  cooperate  by  creating,  accescng,  and  modifying 
elements  in  the  blackboard.  The  activation  of  a KS  is  data- 
driven,  based  on  the  occurrence  of  patterns  in  the  blackboard 
which  match  templates  specified  by  the  knowledge  source. 

Each  level  in  the  blackboard  specifies  a different 
representation  of  the  problem  space;  the  sequence  of  levels 
forms  a loose  hierarchy  in  which  the  elements  at  each  level  can 
approximately  be  described  as  abstractions  of  elements  at  the 
next  lower  level.  This  decomposition  can  be  thought  of  as  an 
a priori  framework  of  a plan  for  solving  the  problem;  each  level 
is  a generic  stage  in  the  plan. 

The  elements  at  each  level  in  the  blackboard  are 
hypotheses  about  some  aspect  of  that  level.  The  internal 
structure  of  an  hypothesis  consists  of  a fixed  set  of  attributes; 
this  set  is  the  same  for  hypotheses  at  all  levels  of 
representation  in  the  blackboard  These  attributes  are  selected 
to  serve  as  mechanisms  for  implementing  the  data-directed 
hypothesize-and-test  paradigm  and  for  efficient  goal-directed 
scheduling  of  KS’s  Knowledge  sources  may  create  networks  of 
structural  relationships  among  hypotheses.  These  relationships, 
which  are  explicit  in  the  blackboard,  serve  to  represent 
inferences  and  deductions  made  by  the  KS’s  about  the 
hypotheses;  they  also  allow  competing  and  Overlapping  partial 
solutions  to  be  handled  in  an  integrated  manner. 

The  Hearsaylt  speech-understanding  system  is  an 
implementation  of  this  organization;  it  is  used  here  as  an 
example  for  descriptive  purposes. 

INTRODUCTION 

This  paper  describes  an  orgamz-don  for  k-owledge- 
based  artificial  intelligence  (Al)  programs.  Although  this 
organization  has  been  derived  while  developing  several 
generations  of  speech  understanding  systems,  we  feel  that  it 
has  general  application  to  other  domains  of  large  Al  problems 
(e  g.,  vision, ^ robotics,  chess,  natural  language  understanding, 
and  protocol  analysis). 

Our  efforts  follow  from  the  early  work  of  Reddy  (1966) 
and  Reddy  and  Vicens  (Vicens,  1969),  through  the  Hearsayl 
system  (Reddy,  et  al.,  1973a,  1973b;  Erman,  1974),  which  was 
the  first  demonstrable  connected-speech  understanding  system, 
up  through  the  currently  developing  Hearsaylt  system  (Erman, 
et  at.,  1973;  Lesser,  et  al.,  1974;  Fennell,  1 975)1 2  3 These  efforts 


1 This  research  was  supported  in  part  by  the  Defense 
Advanced  Research  Projects  Agency  under  • contract  no. 
F44620-73-C-0074  and  monitored  by  the  Air  Force  Office  of 
Scientific  Research. 

2 Reddy  (1973)  is  a comparison  of  the  speech  and  vision 
problem  domains. 

3 Lesser  et  al  (1975)  contains  a detailed  description  of 
Hearsaylt,  including  the  design  decisions  which  mark  its 
derivation  from  Hearsayt. 


have  increasingly  focused  on  the  overall  system  organization  for 
solving  the  problem;  this  has  resulted  in  the  design  and 
construction  of  a sophistl  died  and  structured  environment 
within  which  problem-solving  strategies  are  developed.  G.hers 
working  in  this  area  also  consider  this  aspect  important  * The 
Hearsayll  system  will  be  used  here  as  the  primary  example  for 
describing  the  organization. 


THE  PROBLEM 

The  class  of  Al  problem  that  is  addressed  in  this  paper  is 
characterized  by  having  a large  problem  space  and  the 
requirement  of  a large  amount  of  knowledge  for  its  solution 
The  large  amount  of  explicit  knowledge  differentiates  these 
problems  from  olher  At  areas  (eg.,  theorem-proving)  in  which 
very  general  "weak"  methods  are  applied  using  meager  amounts 
of  built-in  knowledge  (Newell,  1969).  Further,  the  knowledge 
needed  covers  a wide  and  diverse  set  of  areas  (some  examples 
in  the  speech  understanding  problem  are  signal  analysis, 
acoustic-phonetics,  phonology,  syntax,  semantics,  and 
pragmatics).  We  call  each  such  area  a knowledge -sourc • (KS) 
and  also  define  a KS  to  be  an  agent  which  embodies  the 
knowledge  Of  its  area  and  which  can  take  actions  based  on  that 
knowledge. 

The  sources  of  knowledge  are  often  incomplete  and 
approximate.  This  errorful  nature  may  be  traced  to  three 
sources:  First,  the  theory  on  which  the  KS  is  based  may  be 
incomplete  or  incorrect  For  example,  modern  phonological 
theories,  as  applied  to  the  speech  problem,  are  often  vague  and 
incomplete.  Second,  the  implementation,  of  a KS  may  be 
incomplete  or  incorrect;  this  may  be  caused  by  an  incorrect 
translation  of  the  theory  to  the  piogram  or  by  an  intentionally 
heuristic  implementation  of  the  theory.  Finally,  the  knowledge 
source  may  be  operating  on  incorrect  or  incomplete  data 
supplied  to  it  by  other  KS’s.^ 

As  one  knowledge  source  makes  errors  and  creates 
ambigui'ies,  other  KS’s  must  be  brought  to  bear  to  correct  and 
clarify  those  actions.  This  KS  cooperation  should  occur  as  soon 
as  possible  after  the  introduction  of  an  error  or  ambiguity  in 
order  to  limit  its  ramifications. 

A mechanism  for  providing  this  high  degree  of 
cooperation  is  the  hypothesue-and-test  paradigm.  In  this 
paradigm,  solution-finding  is  viewed  as  an  iterative  process. 
Each  step  in  the  iteration  involves  a)  the  creation  of  an 
hypothesis,  which  is  an  "educated  guess"  about  some  aspect  of 


1 Newell,  et  al.,  (1971)  contains  an  excellent  in-depth  study  of 
the  speech  understanding  problem.  The  current  state-of-the- 
art  is  represented  in  Ihe  papers  of  the  1974  tEEE  Symposium 
on  Speech  Recognition  (Erman,  1974b;  Reddy,  1975)  In 
particular,  Barnett  (1973,  1975),  and  Rqvner,  et  al.,  (1974) 
also  describe  highly  structured  systems;  Baker  (1974)  has  a 
highly  structured  system  based  on  a simple  Markov  model. 

2 For  the  purposes  of  this  discussion,  a KS  can  be  considered 
static;  i.e  , whether  a KS  learns  from  experience  is  an  issue 
that  is  orthogonal  to  this  organization, 

3 This  may  also  include  externally  supplied  data  (e  g.,  the 
digitized  acoustic  wave-form  which  is  the  input  to  the 
speech-understanding  system);  the  transducers  of  these  data 
can  be  considered  to  be  KS's  which  also  introduce  error 
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hypothesis  i Jo  designed  T\  The  S,ruc,ure  of  a 

KS’s  to  be  sensitive  to  a * preCOnditi=^  'nos. 
hypothesis  (such  as  thi  creation  of  S'mP,e  Z"**  in  some 

KXS » v sk  ; : 

* i jsss 


' d,w, ln*S  ",!ed  by  Si"”"  <>*«  I" 


an  atZirhto,acharalkeZelesoISkUrCe.keCOmpOSd'0n  ls  no' 
solution  process  and  then  applv^ome  ^ ,ovfra"  Prob,em 
analysis  to  its  internal  working  ! S°rt  °f  ,ra,fic  "»* 
total  process  into  minimal!  intlrarr^L  '°  decomPose  the 
Rather,  knowledge  sources  are  d«i  ' u knowledge  sources 
intuitive  notion  about  the  wi,h  soma 

coutd  be  incorporated  in  a ilet.a'6065  ° knowted«e  which 
solution.  n * use,ut  w»y  to  help  achieve  a 
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find-test  paradigm,  K$  s can  also  exhibit  a high  degree  of 
asynchronous  activity  and  poterha!  parallelism,' 

Control  schemes  in  which  one  KS  explicitly  invokes  other 
KS’s  are  not  appropriate  because  of  the  requirement  that  KS’s 
be  independent  and  because  the  invocation  of  a KS  may  depend 
on  a complex  set  of  conditions  which  is  created  by  the  combined 
actions  of  several  KS’s.  further,  such  direct-calling  schemes 
complicate  KS’s  by  requiring  that  they  contain  information  about 
the  KS  s that  they  will  call.  These  same  arguments  apply 
against  a centralized  conlrol  scheme  which  is  explicitly 
predefined  ("w  red  in")  for  a particular  set  of  KS’s, 


Decomposition  of  t h e Blackboard 

The  blackboard  is  partitioned  into  distinct  information 
levels,  each  level  is  used  to  hold  a different  representation  of 
the  problem  space.  (Examples  of  levels  in  the  speech  problem 
are  "syntactic”,  "lexical",  "phonetic",  and  "acoustic";  examples  in 
scene  analysis  are  "picture  point",  "line  segment",  "region",  and 
object”)  Associated  with  each  level  is  a set  of  primitive 
elements  appropriate  tor  representing  the  problem  at  that  level, 
(In  the  speech  system,  for  example,  the  elements  at  the  lexical 
level  are  the  words  ol  the  vocabulary  to  be  recognized,  while 
the  elements  at  the  phonetic  level  are  the  phones  (sounds)  of 
English.)  Each  hypothesis  exists  at  a particular  level  and  is 
labeled  as  being  a particular  element  ol  the  set  ol  primitive 
elements  at  that  level. 

The  decomposition  ol  the  problem  space  into  levels  is  a 
natural  parallel  to  the  decomposition  into  KS's  of  the  knowledge 
that  is  to  be  brought  to  bear  Eor  many  KS’s,  the  KS  needs  to 
deal  with  only  one  or  a lew  levels  to  apply  its  knowledge;  it 
need  not  even  be  aware  of  the  existence  of  other  levels.  Thus, 
each  KS  can  be  made  as  simple  as  its  knowledge  allows;  its 
interface  to  the  rest  of  the  system  is  in  units  and  concepts 
which  are  natural  to  it.  Also,  new  levels  can  be  added  as  new 
sources  of  knowledge  are  designed  which  need  to  use  them. 
Finally,  it  will  be  shown  that  the  multi-level  representation 
allows  for  efficiently  sequencing  the  activity  ol  the  KS’s  in  a 
non-determimstic  manner  and  for  making  use  of  multiprocessing 
The  sequence  ol  levels  lorms  a loose  hierarchical 
structure  in  which  the  elements  at  each  level  can  approximately 
be  described  as  abstractions  of  elements  at  the  next  lower 
level.  (For  example,  an  utterance  is  composed  of  phrases, 
which  are  made  of  words,  put  together  as  syllables,  each  of 
which  can  be  described  as  a sequence  ol  phones,  each  of  which 
is  composed  ol  acoustic  segments,  each  ol  which  can  be 
described  by  a sequence  of  ten-millisecond  intervals  with 
certain  kinds  of  acoustic  characteristics  ) 

Most  of  the  relationships  of  a hypothesis  are  with 
hypotheses  at  its  level  or  adjacent  levels;  further,  these 
relationships  can  usually  be  derived  (by  a KS  appropriate  to  the 
level)  without  having  to  delve  below  the  level  of  abstraction  of 


the  hypothesis.  This  locality  of  context  simplifies  the  funrtion 
ot  knowledge  sources  (Or  from  the  other  point  of  view,  the 
decomposition  ot  knowledge  into  sufficiently  simple  act1 2  g KS’s 
also  simplifies  and  localizes  relationships  in  the  blackboard,;1 

The  decompose  on  ol  the  blackboard  into  distinct  levels 
of  representation  can  also  be  thought  ol  as  an  a prion 
framework  of  a plan  tor  problem-solving.  Each  level  is  a 
generic  stage  in  the  plan.  The  goal  at  each  level  is  to  create 
and  validate  hypotheses  at  that  level.  The  overall  goal  ot  the 
system  is  to  create  the  most  plausible  network  of  hypotheses 
that  sufficiently  covers  tne  levels.  (’Plausible  and  ’sufficiently’ 
here  mean  “plausible  and  sufficient  in  the  judgment  of  the 
knowledge  sources")  In  speech  understanding,  tor  example,  the 
goal  at  the  phonetic  level  is  a phonetic  transcription  of  the 
utterance,  while  the  overall  goal  is  a network  which  connects 
hypotheses  directly  derived  Irom  the  acoustic  input  to 
hypotheses  which  describe  the  semantic  content  of  the 
utterance 

He  creation  or  modification  ol  an  hypothesis  which  is 
based  on  a context  ol  hypotheses  at  a lower  level  (or  levels) 
can  be  considered  an  action  ol  synthesis,  Or  abstraction; 
conversely,  manipulations  ol  an  hypothesis  based  on  a higher 
level  context  can  be  considered  analysis,  or  elaboration.  In 
order  to  reduce  the  propagation  and  accumulation  of  errors 
caused  by  KS's,  both  kinds  ol  action  aie  needed  in  the  system,? 

Because  ol  the  choice  ot  decomposition,  the  context  for 
an  analysis  or  synthesis  action  is  usually  localized  to  the  level 
just  above  or  below  the  level  at  which  the  action  takes  place. 
However,  this  is  not  a requirement;  in  fact,  an  action  which  skips 
over  several  levels  can  serve  strongly  to  direct  the  activity  of 
the  system  and  thereby  significantly  prune  the  search  space, 
Such  a jump  over  -evels  is  equivalent  to  constructing  a major 
step  in  a plan.  Further,  there  is  no  requirement  that  a jump 
necessarily  be  tilled  in  completely  (or  even  partially)  if  KS’s  are 
confident  enough  in  the  consistency  ol  the  larger  step.  Thus, 
the  KS’s  can  dynamically  define  the  granularity  in  the 
hypothesis  network  necessary  to  assure  the  desired  degree  of 
consistency;  this  granularity  may  vary  at  different  places  in  the 
blackboard,  depending  on  the  particular  structures  that  occur. 

Appendix  A contains  a description  of  the  blackboard  and 
KS  decompositions  for  the  Hearsay!!  speech-understanding 
system. 

thtRQtheses: Structure  and  Interrelationships 

The  internal  structure  ot  an  hypothesis  consists  ol  a fixed 
set  of  attributes  (named  fields);  this  set  is  the  same  for 
hypotheses  at  all  levels  of  representation  in  the  blackboard. 
These  attributes  are  selected  to  serve  as  mechanisms  tor 
implementing  the  data  directed  hypothesize  and-test  paradigm.3 
The  values  ol  the  attributes  are  defined  and  modified  by  the 
K S s. 

Attributes  can  be  grouped  into  several  classes: 


1  One  might  think  ol  this  model  for  data-directed  activation  of 
KS  s as  a production  system  (Newell,  1973)  which  is  executed 
asynchronously.  The  preconditions  correspond  to  the  lelt- 
hand  sides  (conditions)  of  productions,  and  the  knowledge 
sources  correspond  *o  the  right-hand  sides  (actions)  of  the 
productions  Conceptually,  these  left-hand  sides  are 
evaluated  continuously  When  a precondition  is  satisfied,  an 
instantiation  ol  the  corresponding  right-hand  side  of  its 
production  is  created;  this  instantiation  is  executed  at  some 
arbitrary  subsequent  time  (perhaps  subject  to  instantiation 
scheduling  constraints).  It  is  interesting  to  note  that  this 
generalized  form  of  hypothesize  and  test  leads  to  a system 
organization  with  some  characteristics  also  similar  to  QAfl 
(Rulifson,  et  at.,  1973)  and  PLANNER  (Hewitt,  1972),  In 
particular,  there  are  strong  similai  lies  in  the  data-directed 
sequencing  ol  processes 

2  Many  of  the  ideas  here  fit  neatly  into  Simon's  description  of  a 
’’nearly  decomposable  hierarchical  system"  (Simon,  1962). 


1 This  simplification  of  form  and  interaction  is  an  expected 
characteristic  of  a nearly  decomposable  hierarchical  system 

(ibid,). 

2 The  use  of  the  terms  ’analysis’  and  'synthesis’  here  are 
reversed  from  their  usual  uses  in  the  speech  recognition 
domain.  Traditionally,  ’synthesis’  means  going  from  a higher- 
level  representation  (eg , lexical)  to  the  speech  signal,  while 
analysis  refers  to  the  other  direction  In  speech  recognition 
however,  the  objective  is  the  synthesis  of  a meaning  for  the 

signaf nCe  f0m  tHe  PieCeS  0<  da,a  Wh'Ch  maKe  UP  the  sPeech 

3 In  Hearsaylt,  a KS  can  specify  particular  attributes  ol 
hypotheses  at  particular  levels  which  it  wants  to  have 
monitored.  Whenever  a change  is  made  to  one  ol  these 
monitored  attributes,  the  KS  can  be  activated  and  notified  of 
the  mature  of  the  change.  The  section  below  on  ”Data- 
uirected  Activation  of  Knowledge  Sources’’  contains  a more 
complete  description  of  this  process. 
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! he  first  class  of  allnbules  namej  the  hypothesis'  it 
Conans  the  unique  name  of  the  hypothesis,  the  name  of 
its  level,  and  its  laDel  from  the  element  set  at  that  level. 

The  ne’,t  class  °f  attributes  is  composed  of  parameters 
winch  rat*  the  hypothesis.  These  include  separate 
numerical  ratings  derived  from  a)  a priori  information  about 
the  hypothesis,  b)  analysis  actions  performed  on  the 
hypothesis,  c)  synthesis  achons,  and  d)  combinations  of  (a) 
(b),  and  (c)  ' 

Another  set  of  attributes  contains  information  about  KS 
attention  to  the  hypothesis.  These  include  a cumulative 
measure  of  the  amount  of  compulation  that  has  already 
been  expended  on  the  hypothesis  as  well  a;  suggestions 
ur  how  much  more  processing  should  oc.ur  and  of  what 
type  (e.g.  a general  request  for  analysis  or  synihesis  or  a 
specific  request  for  a change  to  some  al  r:i  . *-s  These 
suggestions  are  goats .). 

One  very  imporlant  set  of  attributes  describes  the 
structural  relationship}  with  other  hypotheses  as 
described  below  ' 

For  each  problem  domain,  it  is  l.hely  that  there  are  other 
attributes  which  are  basic  to  the  problem  and  which  should 
be  provided  in  the  structure  of  the  hypotheses;  these  term 
a problem  specific  class  of  attributes  In  speech 
understanding,  tor  instance,  time  is  a fundamental  concept 
SO  the  Hearsay  It  system  has  a class  of  attributes  for 
describing  the  begin-  and  end-time  and  the  duration  of  the 
event  which  the  hypothesis  represents.  (These  attributes 
include  ways  of  explicitly  representing  fuzzy  notions  of  the 
imes.)  For  vision,  likely  attributes  would  include  the 
location  and  dimension  of  the  element  and  trajectory 
information  for  moving  objects 

The  capability  for  arbitrary  KS-spectfic  attributes  is  also 
included.  This  can  be  used  by  a KS  to  hold  arbitrary 
information  about  the  hypothesis;  in  this  way  a KS  need 
not  hold  stale  information  about  the  hypothesis  across 
activations  of  the  KS  and  allows,  for  example,  the  easy 
implementation  of  generator  functions  If  several  KS’s 
share  Knowledge  of  the  name  of  one  of  these  attributes 
each  of  them  can  access  and  modify  the  attribute’s  value 
and  thus  communicate  just  as  if  it  were  a "standard" 
attribute;  this  can  be  used  as  an  escape  mechanism  for 
explicit  KS  intercommunication 

A unique  class  of  hypothesis  attributes,  called  processing 
state  attributes,  contains  succinct  summaries  end 
classifications  of  the  values  of  (he  other  attributes  For 
example,  the  values  of  the  rating  attributes  are  summarized 
and  the  hypothesis  is  classified  as  either  "unrated" 
neutral  (noncommittal),  "verified",  "guaranteed"  (strongly 
verified  and  unique),  or  "rejected".  Other  processing  state 
attributes  summarize  the  structural  relationships  with  other 
hypotheses  and  characterize,  for  example,  whether  the 
ypofhesis  has  been  "sufficiently  and  consistently" 
described  synthetically  (i  e , as  an  abstraction  of 
hypotheses  at  lower  levels).  The  processing  state 
attributes  are  especially  useful  for  efficiently  triggering 
knowledge  sources;  for  example,  a KS  may  specify  in  its 
precondition  that  it  is  to  be  activated  whenever  a 
hypothesis  at  a particular  level  becomes  "verified".  These 
attributes  are  also  used  for  the  goal-directed  scheduling  of 
Knowledge  sources,  as  described  in  the  next  section. 

Given  a specific  hypothesis,  a KS  can  examine  the  value 
°,  any  01  l,s  attributes.  A knowledge  source  also  needs  the 
ability  to  retrieve  sets  of  hypotheses  whose  attributes  satisfy 
conditions  in  which  the  KS  is  interested.  (E.g.,  a KS  in  the 
speech  system  may  want  to  find  all  hypotheses  at  the  phonetic 
level  which  are  vowels  and  which  occur  within  a particular  time 
range  ) The  system  provides  an  associative  retrieval  search 
mechanism  for  accomplishing  this.  The  search  condition  is 


sper  tied  by  a matching  prototype,  which  is  a partial 
specification  of  the  components  of  a hypothesis  This  partial 
specification  permits  a component  to  be  characterized  by  a)  a 
set  of  desired  values  or  b)  a don’t-care  condition.  A matching 
prototype  is  applied  to  a set  of  hypotheses  * those  hypotheses 
wfose  component  values  match  those  specified  by  the 
matching  prototype  are  returned  as  the  result  ot  the  search 
(Associative  retrieval  of  structural  relationships  among 
ypotheses  is  also  provided.)  Wore  complex  retrievals  can  be 
accomplished  by  combining  the  relrieval  primitives  in 
appropriate  ways. 

Structural  relationships  between  nodes  (hypotheses)  in 
the  blackboard  are  represented  through  the  use  of  link,-,  links 
provide  a means  of  spenfying  contextual  abstractions  about  the 
relationships  ot  hypotheses.  A link  is  an  element  which 
associates  two  hypotheses  as  an  ordered  pair;  one  ol  the  nodes 
is  termed  the  upper  hypothesis,  and  the  other  is  called  the  lower 
hypothesis  The  lower  hypothesis  is  said  to  support  the  upper 
hypothesis  while  the  upper  hypolhesis  is  called  a use  of  the 
ower  one;  ,n  general,  the  lower  hypolhesis  ,s  at  the  same  or  a 
lower  level  in  the  blackboard  than  the  upper  hypothes.s. 

There  are  several  types  of  links,  with  the  types 
des.ribing  various  kinds  of  relationships.2  Consider  this 
structure: 


L>  jL2  \L3 
H2  H3 

HI  is  the  upper  hypothesis  and  H2,  H3,  and  HA  are  the  lower 
hypotheses  of  links  LI,  L2,  and  L3,  respectively.  If  the  links  are 
all  o'  type  OR,  the  interpretation  is  that  HI  is  either  an  H2  or 
an  H3  or  an  HA.  This  is  one  way  that  alternative  descriptions 
are  possible.  If  the  links  in  the  figure  are  of  type  AND,  the 
interpretation  ,s  that  all  of  the  lower  hypotheses  are  necessary 
to  support  the  exislence  of  HI.  (Note  that,  in  general,  all  of  the 
supporting  (lower)  links  of  a hypothesis  are  of  the  same  type; 
one  can  thus  talk  of  the  "typo  of  the  hypothesis",  which  is  the 
same  as  the  type  of  all  of  its  lower  links.) 

These  two  t/oes  of  node  represent  different  Kinds  of 
*bhS  ra(ltl°AS|i:,nthe  0R'node  specks  • set/member  relationship 
W|h'i!  Lfn  D'n°de  de,ines  * composition  abstraction.  Variants 

TFO/h/rwrrD  , ^ 0R'lmks  8re  *!s0  Possible.  For  example,  a 
SEQUENCE  hnK  ,s  similar  to  the  AND-link  except  that  an 
ordenng  ,s  implied  on  the  set  ot  lower  hypotheses  supporting 
the  upper  hypolhesis.  (For  the  Hearsayll  speech  understanding 
system,  this  ordering  usually  ,s  interpreted  as  indicating  a time 
ordering  of  the  lower  hypotheses.) 

Besides  showing  analysis  and  synihesis  relationships 
between  hypotheses  (e.g,  that  one  hypothesis  is  composed  of 
several  other  units),  a link  is  a statement  about  the  degree  to 
which  one  hypothesis  implies  (i.e,  "gives  evidence  for  the 
existence  of  ) another  hypothesis.  The  strength  of  the 
implication  is  held  as  attributes  ot  the  link.  The  sense  of  the 
implication  may  be  negative;  that  is,  a link  may  indicate  that  one 


ue,lvea  'he  Kb  from  several  sources.  The 
Hearsayll  implementation  includes  the  following  primitive 

hvnnCfhS:  * ? hyp0,heses  <ln  I be  blackboard),  b)  all 

hypotheses  a a particular  level,  c)  all  hypotheses  at  a 
pa  ticular  level  whose  lime  attributes  overlap  a given  interval 
8n  *)<tremely  efficient,  two-dimension  partition 

Which  b a'kb08rd)’  8nd  d>  al1  hypotheses  whose  attributes 
which  are  being  monitored  (lor  the  KS)  have  changed 

2 n|he,hPnV»llC|hatr  k'ndS  °'  rela,i°nshiPs  described  here  are  some 

A thnuDh  h WC?  uef!  deiigned  ,or  'be  speech  problem 
Although  hey  undoubtedly  are  not  the  complete  set  for  all 

conceivable  needs,  they  do  represent  the  kinds  o' 
'hl"  neSd  be  8"d  8re  expressible  in  the 
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hypothesis  is  evidence  for  the  invalidity  of  anolher.  Ttvs 
statement  of  implicalion  may  be  bi-directional;  the  existence  of 
the  upper  hypothesis  may  give  credence  to  the  existence  of  the 
lower  hypothesis  and  vice  versa  Finally,  these  relationships  can 
be  constructed  in  an  ilerative  manner;  links  can  be  added 
between  existing  hypotheses  by  KS's  as  they  discover  new 
evidence  for  support, 

Just  as  an  hypothesis  can  have  more  than  one  lower  link, 
so  it  can  have  several  upper  links.  Each  of  these  represents  a 
different  use  of  t hie  hypothesis;  Ihe  uses  may  be  competing  or 
complementary  The  ability  to  have  multiple  uses  and  supports 
of  the  same  hypothesis,  as  opposed  In  realing  duplicates  for 
each  competing  ure  and  abstraction,  serves  to  keep  the 
blackboard  compact  and  thereby  reduces  ttie  combinatoric 
explosion  in  the  search  space.  Ft  I her,  since  ad  the  information 
about  Ihe  hypothesis  is  local. /ed,  all  uses  and  supports  of  the 
hypothesis  aulomal  ically  and  immediately  share  any  new 
information  added  to  Ihe  hypothesis  by  any  knowledge  sources 

A problem  with  this  localization  can  occur  if  the 
interactions  between  hypotheses  span  more  lhan  one  level,  ^ tn 
this  case,  a particular  support  of  the  hypothesis  (at  a lower 
level)  may  be  inconsistent  with  one  (or  more)  of  the  uses  of  the 
hypothesis  (at  a higher  level)  but  is  consistent  with  other  uses 
(or  potential  uses)  of  the  hypothesis,  in  order  to  avoid 
duplicating  the  hypothesis,  a mechanism,  called  a connection 
matrix,  exists  in  the  system,  A connection  matrix  is  an  attribute 
of  a hypothesis;  its  value  specifies  which  of  the  alternative 
supports  of  the  hypothesis  are  applicable  ("connected  to") 
which  of  its  uses.  The  use  of  a connection  matrix  allows  the 
results  of  previous  decisions  of  KS's  to  be  accumulated  for 
future  use  and  modification  without  necessitating  contextual 
duplication  of  parts  of  the  data  base.  This  kind  of  reusage  and 
multiple  usage  of  blackboard  structures,  which  results  in 
localization  of  information,  reduces  much  of  the  expensive 
backtracking  that  characterizes  many  problem-solving  systems. 

Appendix  B contains  an  example  of  a structure  built  in  the 
blackboard  of  the  Hearsayll  system. 

Coal-Directed  Scheduling  of  Knowledge  Sources 

As  described  earlier,  the  overall  goal  of  the  system  is  to 
create  the  most  plausible  network  of  hypotheses  that 
sufficiently  spans  the  levels.  At  any  instant  of  time,  the 
blackboard  may  contain  many  incomplete  networks,  each  of 
which  is  plausible  as  far  as  it  goes.  Some  of  these  incomplete 
networks  may  also  share  subnetworks.  Through  the  results  of 
analysis  and  synthesis  actions  of  knowledge  sources,  incomplete 
networks  can  be  expanded  (or  contracted)  and  may  be  joined 
together  (or  fragmented).  At  any  time,  there  may  be  many 
places  in  the  blackboard  which  satisfy  the  (precondition) 
contexts  for  the  activation  of  particular  KS’s,  The  task  of  goal- 
directed  scheduling  is  to  decide  to  which  of  these  sites  to 
allocate  computing  resources. 

Several  of  the  attribute  classes  of  a hypothesis  car  be 
helpful  in  making  scheduling  decisions.  Particularly  valuable  are 
the  values  of  the  attention  attributes,  which,  as  described 
earlier,  are  indicators  telling  how  much  computation  has  been 
expended  on  the  hypotheses  and  suggestions  by  KS's  of  how 
desirable  it  is  to  devote  further  effort  on  the  hypothesis  (along 
with  the  kinds  of  processing  that  are  desirable).  The  processing 
state  attributes  are  also  valuable  for  making  scheduling 
decisions. 

Using  these  kinds  of  information,  a knowledge  source 
might  be  scheduled  for  execution  because  it  possesses  the  only 
processing  capability  available  to  be  applied  to  an  important 
incompletely  explored  area  of  the  blackboard.  For  example,  if 


1 Again,  this  fits  well  into  Simon’s  formulation  of  hierarchical 
systems 


the  blackboard  conlains  focusing  factors'-  whirh  highlight 
activity  in  a olackboard  region  in  which  there  are  no  structural 
connections  belween  two  adjoining  levels,  the  schedule’  should 
give  a h-gher  priority  to  a knowledge  source  which  will  attempt 
(as  indicated  in  its  external  specifications)  to  make  such  a 
connection  lhan  to  a knowledge  source  which  is  likely  merely  to 
periorm  a minor  refinement  on  the  ratings  in  one  of  the  levels, 
fv  ever,  if  there  are  no  such  processes  ready  to  execute,  the 
scheduling  algorithm  can  perform  a type  of  means-ends  analysis 
in  which  it  schedules  those  i nowledge  sources  which  are  likely 
to  produce  blackboard  changes  which,  in  turn,  m^ht  trigger  tho 
activation  of  KS’s  in  which  the  system  is  currently  interested. 

Another  parameter  for  determining  KS  activation  priority 
is  the  validity  of  the  hypotheses  which  make  up  this  context  for 
the  activation  of  the  KS,  This  measurement  can  be  used  to 
implement  a best-first  strategy 

The  implementation  of  the  goal-directed  scheduling 
strategy  is  separated  from  the  actions  of  individual  knowledge 
sources.  Thai  is,  the  decision  of  whether  a KS  can  contribute  in 
a particular  context  is  local  to  the  KS,  while  the  assignment  of 
that  KS  to  one  of  the  many  contexts  on  which  it  can  possibly 
operate  is  made  more  globally  The  three  aspects  of 

a)  decoupling  of  focusing  stra'egy  from  knowledge-source 
activity,  b)  decoupling  of  Ihe  data  environment  (blackboard) 
from  the  control  flow  (KS  activation),  and  c)  the  limited  context 
in  which  a KS  operates,  together  permit  a quick  refocusing  of 
attention  of  KS’s.  The  ability  to  refocus  quickly  is  very 
important  because  the  errorful  nature  of  the  KS  activity  leads 
to  many  incomplete  and  possibly  contradictory  hypothesis 
networks;  thus,  as  soon  as  possible  after  ■ network  no  longer 
seems  promising,  Ihe  resources  of  the  system  should  be 
employed  elsewhere,* 

IMPLEMENTATION  OF  DATA- DIRECTED  KNOWLEDGE- 
SOURCE  ACTIVATION 

Associated  with  every  knowledge  source  is  • specification 
o he  blackboard  conditions  required  for  the  activation  of  that 
knowledge  source.  This  specification,  called  a precondition,  is  a 
decision  procedure  whose  tests  are  matching-prototypes  and 
structural  relationships  which,  when  applied  to  the  blackboard  in 
an  associative  manner,  detect  the  regions  of  the  blackboard  in 
which  the  knowledge  source  is  interested.  This  procedure  may 
contain  arbitrarily  complex  decisions  (based  on  current  and  past 
modifications  to  the  blackboard)  resulting  in  the  activation  of 
desired  knowledge  sources  within  the  chosen  contexts.  The 
context  corresponding  to  the  discovered  blackboard  region 
which  satisfies  some  knowledge  source’s  precondition  is  used  as 
an  initial  context  in  which  to  activate  that  knowledge  source. 
The  efficiency  of  the  KS  precondition  evaluation  is  an  important 
aspect  of  the  system’s  implementation,  especially  as  Ihe 
knowledge  is  decomposed  into  more  and  smaller  KS’s  and  each 
KS  activation  requires  less  computation. 

The  Hearsayll  system,  as  an  example  of  an 

implementation,  makes  precondition  evaluation  efficient  by 
placing  additional  functions  in  the  routines  Which  modify  the 
blackboard.  These  functions  are  activated  whenever  any  KS 
modifies  an  attribute  in  the  blackboard  which  some  other  KS  has 
asked  to  be  monitored.  The  essence  of  the  modification  is 
preserved  in  a data  structure,  called  a change  set,  which  is 


2 A focusing  factor  is  a goal  (attention  attribute)  attached  to  a 
hypothesis  by  a KS  which  indicates  the  kind  of  change 
desired  in  an  attribute  of  the  hypothesis  and  the  desirability 
of  the  change.  In  addition,  such  goals  may  be  specified  for 
regions  of  the  blackboard  independent  of  the  existence  of 
hypotheses  in  the  region. 

1 The  ideas  of  goc'-direded  scheduling  are  presented  here 
only  sketchily;  Hayes-Roth  et  al  (1975)  provides  a complete 
description  of  its  use  in  the  Hearsayll  system. 
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specific  to  the  attribute  changed  and  ‘he  KS  which  requested 
trie  monitoring.  A KS  specifies  in  a non  procedural  way  (either 
statically  or  dynamically)  those  attributes  which  it  wunts  to 
monitor  tn  order  to  increase  the  efliciency,  monitoring  can 
further  be  localized  to  particular  levels  or  even  individual 
hypotheses. 

Change  sets  serve  to  categorize  blackboard  modifications 
(events)  and  are  thus  useful  in  precondition  evaluation  since 
they  limit  the  areas  in  the  blackboard  that  need  be  examined  in 
detail,  As  currently  , implemented  in  Hearsaytt,  the  precondition 
evaluator  of  each  knowledge  source  exists  as  a separate 
process  which  monitors  changes  m tne  data  base  (i.e.,  it 
monitors  additions  to  those  change  sets  at  which  the  KS  is 
interested).  The  precondition  process  is  itself  data-directed  in 
that  it  is  activated  only  when  sufficient  changes  have  been 
made  in  the  blackboard  (i.e,  when  an  entry  is  made  into  one  of 
its  change  sets,  as  a side-effect  of  a relevant  blackboard 
modification),  tn  etfect,  the  precondition  processes  themselves 
have  preconditions,  albeit  of  a much  simpler  form  than  those 
possible  for  knowledge  sources.  For  example,  a precondition 
process  in  the  speech  system  may  specify  that  it  should  be 
activated  whenever  changes  occu'  to  two  adjacent  hypotheses 
at  the  word  level  or  whenever  support  is  added  to  the  phrasal 
level.  Oy  using  the  (coarse)  classifications  afforded  by  change 
sets,  the  system  avoids  most  unnecessary  executions  of  the 
precondition  processes.  The  major  point  is  that  the  scheme  of 
precondition  evaluation  is  event -driven,  being  based  rn  the 
occurrence  of  changes  in  the  blackboard;  i e.,  it  is  only  at  points 
of  modification  to  the  blackboard  that  a preconditior  that  was 
previously  unsatisfied  may  become  satisfied  Ir.  particular, 
precondition  evaluators  are  not  involved  in  a form  of  busy 
waiting  in  which  they  are  constantly  looking  fnr  something  that 
is  not  yet  there 

Once  invoked,  a precondition  procedure  uses  sequences 
of  associative  retrievals  and  structural  matches  on  portions  of 
the  blackboard  in  an  attempt  to  establish  a context  satisfying 
the  reconditions  of  one  or  more  of  "its"  knowledge  sources; 
any  given  precondilion  procedure  may  be  responsible  for 
instantiating  several  (related)  knowledge  sources.  Notice  that 
the  data-directed  nature  of  precondition  evaluation  and 
knowledge -source  activation  is  linked  closely  to  the  primitive 
functions  that  are  able  to  modify  the  data  base,  for  it  is  only  at 
points  of  modification  that  a precondition  that  was  unsatisfied 
before  may  become  satisfied.  Hence,  data  base  modification 
routines  have  the  responsibility  (although  perhaps  indirectly)  of 
activating  the  precondition  evaluation  mechanism. 

Implementation  on  Parallel  Computers 

Because  of  the  independence  of  KS’s  and  their  data- 
directed  activation,  there  is  a great  deal  of  px-’ential  parallelism 
in  this  organization.  Trends  in  computer  architecture  indicate 
that  large  amounts  of  computing  power  will  be  economically 
realized  in  asynchronous  multiprocessor  networks.  Thus,  the 
implementation  of  such  large  Al  programs  on  multiprocessors 
becomes  an  attractive  goal.  There  are,  however,  a set  of  issues 
in  such  an  implementation;  most  of  these  deal  with  interference 
among  KS’s  when  they  attempt  simultaneously  to  access  the 
blackboard.  Effect  ve  solutions  to  these  proolems  have  been 
developed  in  the  Hearsay II  implementation;  Lesser,  et  al.,  (1974), 
Lesser  (1975),  Fennell  (1975),  and  Fennell  and  Lesser  (1975) 
describe  these  solutio  s. 
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Abstract — Hearsay  II  (HSII)  is  a system  currently  under  develop- 
ment at  Carnegie-Mellon  University  to  study  the  connected  speech 
understanding  problem  It  is  similar  to  Hearsay  I (HSI)  in  that  it 
is  based  on  the  hypothesize-and-test  paradigm,  using  cooperating 
independent  knowledge  sources  communicating  with  each  other 
through  a global  data  structure  (blackboard).  It  differs  in  the  sense 
that  many  of  the  limitations  and  shortcomings  of  HSI  are  resolved 
in  HSII. 

The  main  new  features  of  the  Hearsay  II  system  structure  are: 
1)  the  representation  of  knowledge  as  self-activating,  asynchronous, 
parallel  processes,  2)  the  representation  of  the  partial  analysis  in  a 
generalized  three-dimensional  network  (the  dimensions  being  level 
of  representation  (e.g.,  acoustic,  phonetic,  phonemic,  lexical,  syn- 
tactic), time,  and  alternatives)  with  contextual  and  structural  sup- 
port connections  explicitly  specified,  3)  a convenient  modular  struc- 
ture for  incorporating  new  knowledge  into  the  system  at  any  level, 
and  4)  a system  structure  suitable  for  execution  on  a parallel 
processing  system. 

The  main  task  domain  under  study  is  the  retrieval  of  daily 
wire-service  news  stories  upon  voice  request  by  the  user.  The  main 
parametric  representations  used  for  this  study  are  1 3-octave  filter- 
bank  and  linear-predictive  coding  (LPC)-derived  vocal  tract  param- 
eters 110),  11 R.  The  acoustic  segmentation  and  labeling  procedures 
are  parameter-independent  17].  The  acoustic,  phonetic,  and  phono- 
logical components  123]  are  feature-based  rewriting  rules  which 
transform  the  segmental  units  into  higher  level  phonetic  units.  The 
vocabulary  size  for  the  task  is. approximately  1200  words.  This  vo- 
cabulary information  is  used  to  generate  word-level  hypotheses 
from  phonetic  and  surface-phonemic  levels  based  on  prosodic 
(stress)  information.  The  syntax  for  the  task  permits  simple  English- 
like  sentences  and  is  used  to  generate  hypotheses  based  on  the 
probability  of  occurrence  of  that  grammatical  construct  119].  The  se- 
mantic model  is  based  on  the  news  items  of  the  day,  analysis  of  the 
conversation,  and  the  presence  of  certain  content  words  in  the 
partial  analysis.  This  knowledge  is  to  be  represented  as  a production 
system.  The  system  is  expected  to  be  operational  on  a 16-processor 
minicomputer  system  13)  being  built  at  Carnegie-Mellon  University. 

This  paper  deals  primarily  with  the  issues  of  the  system  organiza- 
tion of  the  HSII  system. 


lXTRODl'CTlOX 

THE  HEARSAY  II  (IISII)  speech  understanding  sys- 
tem is  a successor  lo  the  Hearsay  I (HSI)  system  [10], 
(_17],  HSII  represent.;,  in  terms  of  both  its  system  organi- 
zation and  its  speech  knowledge,  a significant  increase  in 
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sophistication  and  gt  nerality  over  IISI.  The  development 
of  I IS  1 1 has  been  based  on  two  years  of  experience  with  a 
running  version  of  I1S1,  a desire  to  exploit  multiprocessor 
and  network  computer  architecture  for  efficient  imple- 
mentation [ff],  [4],  [(>],  and  a desire  to  handle  more  com- 
plex speech  task  domains  (e.g.,  larger  vocabularies,  less 
restricted  grammars,  and  a more  complete  set  of  know- 
ledge sources  including  prosodies,  user  models,  etc.). 
While  from  a conceptual  point  of  vi  v HSII  is  si  natural 
extension  of  the  framework  that  1IS1  posited  for  a speech 
understanding  system,  it  differs  significantly  in  its  design 
and  in  its  details  of  implementation.1 

THE  HEARSAY  SYSTEM  MODEL 

I1S1  was  based  on  the  view  that  the  inherently  errorful 
nature  of  connected  speech  processing  could  be  handled 
only  through  the  efficient  use  of  multiple,  diverse  sources  of 
knowledge  [Id],  [lo].  The  major  focus  of  the  design  of 
HSI  was  the  development  of  a framework  for  representing 
these  diverse  sources  of  knowledge  and  their  cooperation 
[IS].  This  framework  is  the  conceptual  legacy  which  forms 
the  basis  for  the  HSII  design. 

There  arc  four  dimensions  along  which  knowledge 
representation  in  HSI  can  he  described:  1)  function,  2) 
structure,  d)  cooperation,  and  4)  attention  focusing. 

The  function  of  a knowledge  source  (IvS)  in  HSI  has 
three  aspects.  The  first  is  for  the  IvS  to  know  when  it  has 
something  useful  to  contribute,  the  second  is  to  contribute 
its  knowledge  through  the  mechanism  of  making  a hy- 
pothesis (guess)  about  some  aspect  of  the  speech  utterance, 
and  the  third  is  to  evaluate  the  contribution  of  other 
knowledge  sources,  i.e.,  to  verify  and  reorder  (or  reject) 
the  hypotheses  made  by  other  knowledge  sources.  Each  of 
these  aspects  of  a KS  is  carried  out  with  respect  to  a 
particular  context,  the  context  being  some  subset  of  the 
previously  generated  hypotheses.  Thus,  new  knowledge 
is  built  upon  the  educated  guesses  made  at  some  previous 
time  by  other  knowledge  sources. 

The  structure  of  each  knowledge  source  in  HSI  is  speci- 
fied so  that  it  is  independent  and  separable  from  all  other 
IvS’s  in  the  system.  This  permits  the  easy  addition  of  new 

1 Severn!  other  organizations  for  speech  understanding  systems 
;ive  been  implemented  over  the  past  few  years.  Included  among 
he  more  interesting  ones  are  the  SDO  system  |2],  [20),  the  Bolt 
Heranek  and  Newman  (lil)X)  Sl’lCECllLIS  system  [21],  [24],  the 
DRAGON  system  |lj,  and  the  IBM  system  (1*1. 
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tyjics  of  KS  s ii ; I replacement  of  KS’s  with  alternative 
versions  of  those  KS’s.  Hun,  I he  sysiem  structure  can  lie 
easily  ; i < 1 . 1 1 > 1 1 ■< I to  new  speech  tusk  (lotiu.ins  which  have 
J'''  s spec  ilic  to  that  iloinain.  and  tin'  emit rilmt ion  of  a par- 
ticular KS  to  tin-  total  recognition  ofTorl  can  he  more 
easily  i valuated. 

I he  choice  of  a framework  for  cooperation  among  K y ’s 
is  intimately  interwoven  with  the  hiuetion  and  structure 
of  knowledge  in  I1H1  The  mechanism  for  KS  cooperation 
involves  hgpothrsbin y and  testing  (creating  and  evaluating) 
hypotheses  in  a global  (lain  haw  (blackboard).  The  gener 
ation  and  modification  of  globally  accssihle  In  pothescs 
thus  hecoint's  the  primary  mentis  of  communication  be- 
tween diverse  KS’s.  I his  mechanism  of  cooperation  allows 
a KS  to  contribute  knowledge  without  being  aware  of 
which  other  KS’s  will'  use  it^  knowledge  or  which  KS 
contributed  the  knowledge  that  it  used.  Thus,  each  KS 
can  he  intuit  independent  and  separable. 

The  global  data  base  that  KS’s  use  for  cooperation 
contains  many  possible  interpretations'  of  the  speech  data. 
Each  of  these  interpretations  represents  a "limited”  con- 
text in  which  a KS  can  possibly  contribute  information 
by  proposing  or  validating  hypotheses.  Attention  focusing 
of  a KS  involves  choosing  which  of  these  limited  contexts 
it  will  operate  in  and  for  how  much  processing  time.  The 
attention  focusing  strategy  is  decoupled  from  the  functions 
of  individual  KS’s.  Thus,  the  decision  of  whether  a KS 
can  contribute  in  a particular  context  is  local  to  the  KS, 
while  the  assignment  of  that  KS  to  one  of  tin'  many  con- 
texts on  which  it  can  possibly  operate  is  made  more 
globally.  This  decoupling  of  focusing  strategy  from  know- 
ledge acquisition,  together  with  the  decoupling  of  the  data 
environment  (global  data  base)  from  control  flow  (KS 
invocation)  and  the  limited  contox*  in  which  a KS  oper- 
ates, permits  a quick  refocusing  of  attention  of  KS’s. 
The  ability  to  refocus  quickly  is  very  important  in  a 
speech  understanding  system,  because  the  errorful  nature 
of  the  speech  data  and  its  processing  leads  to  many  po- 
tential interpretations  of  the  speech  Thus,  as  soon  as 
possible  after  an  interpretation  no  longer  seems  the  most 
promising,  the  activity  of  the  system  should  be  refocused 
to  the  new  most  promising  interpretation. 

OVERVIEW  OF  HSI 

The  following  is  a brief  description  of  the  HSI  imple- 
mentation for  this  model  of  knowledge  source  representa- 
tion and  cooperation.  This  description  will  then  be  used 
to  contrast  the  differences  of  implementation  philosophy 
between  IISI  and  HSI1.  (More  complete  descriptions  of 
HSI  are  contained  in  [f>],  [12],  [10],  [17].) 

IISI  I mplementation  Overview 

Die  global  data  base  of  HSI  consists  of  partial  sentence 
hypotheses,  each  being  a sequence  of  words  with  non- 
overlapping time  locations  in  the  utterance.  It  is  a partial 
sentence  hypothesis  because  not  all  of  the  utterance  need 
be  described  bv  the  given  sequence  of  words.  In  particular, 
gaps  m the  knowledge  of  the  utterance  are  designated  by 


“filler”  words.  The  partial  sentence  hypotheses  also  con- 
tain confidence  ratings  for  each  word  hvpothesis  and  a 
composite  rating  for  the  overall  sequence  of  words.  A 
sentence  hypothesis  is  the  focal  point  that  is  used  to  invoke 
a knowledge  source.  I lie  sentence  hypothesis  also  contains 
the  accumulation  of  all  information  that,  any  KK  has  con- 
tributed to  that  hypothesis. 

KS  s are  invoked  in  a lockstep  sc.  pin  ice  consisting  of 
three  phases:  poll,  hypothesize,  and  test.  At,  each  phase,  all 
KS  s are  invoked  for  that  phase,  and  the  next  phase  does 
not  commence  until  all  KS’s  have  completed  the  current 
one.  The  poll  phase  involves  determining  which  KS’s 
have  something  to  contribute  to  the  sentence  hypothesis 
which  is  currently  being  focused  upon;  polling  also  deter- 
mines how  confident  each  KS  is  about  its  proposed  con- 
tributions. The  hypothesize  phase  consists  of  invoking 
the  KS  showing  the  most  confidence  about  its  proposed 
contribution  of  knowledge.  Ibis  KS  then  hypothesizes 
a set  of  possible  words  (option  words)  for  some  (one) 
filler  word  in  the  speech  utterance.  The  testing  phase 
consists  Ilf  each  KS  evaluating  (verifying)  the  possible 
option  words  with  respect  to  the  given  context.  After  all 
KS’s  have  completed  their  verifications,  the  option  words 
which  seem  most  likely,  based  on  the  combined  ratings 
of  all  the  KS’s,  an  then  used  to  construct  new  partial 
sentence  hypotheses.  The  global  data  base  is  then  re-evalu- 
atcd  to  find  the  most  promising  sentence  hypothesis; 
this  hypothesis  then  becomes  the  focal  point  for  the  next 
hypothesize-and-test  cycle. 

IISI  Performance 

1 he  HSI  system  first  demonstrated  live,  connected- 
speech  recognition  publicly  in  June  11)72.  Since  that  time, 
about  three  man-years  have  been  spent  in  improving  its 
performance.  The  system  has  been  tested  on  a set  of  144 
connected  speech  utterances,  containing  070  word  tokens, 
spoken  by  five  speakers,  and  consisting  of  four  tasks,  with 
vocabularies  ranging  from  28  to  70  words.  On  the  average, 
the  system  locates  and  correctly  identifies  about  93  per- 
cent of  the  words,  using  all  of  its  KS’s.  Without  the  use 
of  the  Semantics  KS,  the  accuracy  decreases  to  70  per- 
cent. It  decreases  further  to  about  39  percent  when  neither 
Syntax  nor  Semantics  are  used.  A more  complete  per- 
formance summary  may  he  found  in  the  Appendix. 

'IS!  Design  Limitations 

There  are  four  major  design  decisions  in  the  HSI  imple- 
mentation of  knowledge  representation  and  cooperation 
which  make  it  difficult  to  apply  HSI  to  more  complex 
speech  tasks  or  multiprocessor  environments. 

The  frst,  and  most  important,  of  these  limiting  de- 
cisions concerns  the  use  of  the  hypothesize-and-test  par 
adigm.  As  implemented  in  HSI,  the  paradigm  is  exploited 
only  at  the  word  level.  The  implication  of  hypothesizing 
and  testing  at  only  the  word  level  is  that  the  knowledge 
representation  is  uniform  only  with  respect  to  coopera- 
tion at  that  level.  That  is,  the  information  content  of  any 


Org.  of  HS  II  26 


i.rsMK  ,„.(.:.iiAKSAVi'm1.'.rMHHMANWN,;s1s,rKM 


clement  h.  the  global  data  has,,  is  limited  to  a drscr.pt .0.1 
„ the  Nu.nl  level.  The  ad  litim,  of  nonword  evel  1^ 

(ia  KS’s  coopt rating  via  either  su1.nn.ux1  levels,  such  as 
syllables  or  phones,  or  via  supra  word  levels,  such  as  phrases 
or  concepts)  thus  becomes  cumbersome  because  tins  know- 
ledge must  somehow  he  related  to  hypo. hcs.zmg  and 
testing  at  the  NVord  level  This  approach  to  nonword  leve 
KS’s  makes  it  difficult  to  add  nomvord  knowledge  ami 
,o  evaluate  the  contribution  of  this  knowledge.  In  addi- 
tion the  inability  to  share  nonword  level  information 
nn,o„K  KS’s  causes  such  information  to  he  recomputed 

bv  each  KS  that  needs  it. 

S ;Prm<l  HSI  constrains  the  hypothesize-and-test  pai- 
•idigm  to  operate  in  a lockstep  control  sequence  1 he 
effect  of  this  decision  is  to  limit  parallelism  because  th< 
time  required  to  complete  a hypothesize-and-test  cycle 
is  the  maximum  time  required  by  any  single  hypothesizcr 
KS  plus  the  maximum  time  required  by  any  single  verifier 
(testing)  KS.  Another  disadvantage  of  this  control  scheme 
is  that  it  increases  the  time  it  takes  the  system  to  refocus 
attention,  because  there  is  no  provision  for  any  communi- 
cation of  partial  results  among  KS’s.  Thus,  for  example,  a 
rejection  of  a particular  option  word  by  a KS  will  not  be 
noticed  until  all  the  KS's  have  tested  all  the  option  words. 

The  third  weakness  hi  the  1IS1  implementation  concerns 
the  structure  of  the  global  data  base:  there  is  no  provision 

for  specifying  relationships  among  alternative  sentence  hy- 
potheses' The  absenei  of  relational  structures  among  hy- 
potheses has  the  effect  of  increasing  the  overall  computa- 
tion time  and  increasing  the  time  to  refocus  attention, 
because  the  information  gained  by  working  on  one  hy- 
pothesis cannot  be  shared  by  propagating  it  to  other 
relevant  hypotheses. 

The  fourth  limiting  design  decision  relates  to  how  a 
global  problem-solving  strategy  (policy)  is  implemented 
in  HSI  : policy  decisions,  such  as  those  involving  attention 
focusing,  are  centralized  (in  a “Recognition  Overlord  ), 
and  there  is  no  coherent  structure  for  the  policy  algorithms. 
The  effect  of  having  no  explicit  system  structure  for  im- 
plementing policy  decisions  makes  it  very  awkward  to 
add  or  delete  new  policy  algorithms  and  difficult  to  analyze 
the  effectiveness  of  a policy  and  its  interaction  with  other 
policies. 

OVERVIEW  OK  HSI1 

Experience  with  HSI  (as  described  above)  has  led  to 
several  important  observations  about  a more  general, 
uniform,  ami  natural  structure  for  representing  and  operat- 
ing on  the  (dynamic)  state  of  the  utterance  recognition 

1 1 The  information  contained  in  hypotheses  at.  different 
levels  of  knowledge  representation  may  be  encoded  m 
essentially  identical  internal  structures,  except  for  the 
primitive  unit  of  information  hold  in  ail  hypothesis.  Inis 
structural  homogeneity  in  the  global  data  base  allows  the 

t The  meaning  of  these  observations  "ill  *><’  m,ule  more  ( ll'!lr  h> 
I be  furt  her  descriptions  that  follow 


actions  of  hypothesizing  and  testing  at  these  various  levels 

to  he  treated  in  a uniform  manlier 

•>)  The  different  tvpesof  knowledge  (and  their  relation- 
ships, present  in  speech  may  he  naturally  represented  m a 
single,  uniform  data  structure . This  data  structure  is 
three-dimensional:  one  dimension  represents  information 
levels  (eg.,  phrasal,  lexical,  phonetic),  the  second  repre- 
sents speech  time,  and  the  third  dimension  contains  alter- 
native (competing)  hypotheses  at  a particular  lev.  and 
time.  These  three  dimensions  form  a convenient  address- 
ing structure  for  lu.ating  hypotheses. 

T)  There  is  a conceptually  simple  and  uniform  way  ot 
(Ivnaniieallv  relating  hypotheses  at  one  level  of  know- 
ledge to  alternative  hypotheses  at  that  level  and  to  hy- 
potheses at  other  knowledge  levels  in  the  structure.  I he 
resulting  structure  is  an  am.  cm  graph  with  modifications 
which  provide  for  temporal  relationships  and  selective 
dependency  relationships.3 


System  Structure 

The  main  goal  of  the  11SI1  design  is  to  extend  the  con- 
cepts developed  in  HSI  for  the  representation  and  coopera- 
tion „f  knowledge  at  the  word  level  to  all  levels  of  kimHedge 
needed  in  a speech  understanding  system,  based  on  the 
preceding  observations. 

The  generalization  of  the  hypothesize-and-test  paradigm 
to  all  levels  of  speech  knowledge  implies  the  need  for  a 
mechanism  for  transferring  information  among  levels. 
This  mechanism  is  already  embodied  in  the  hypothesize- 
and-test  paradigm;  that  is,  one  can  characterize  two  types 
„f  hvpot hesitation  a knowledge  source  might  be  called 
upon  to  perform:  horizontal  and  vertical  hypothecation. 

A hypothecation  is  horizontal  when  a KS  uses  contextual 
information  at  a given  knowledge  level  to  predict  new 
hypotheses  at  the  same  level  (c.g.,  the  hypothecation 
that  the  word  “night”  might  follow  the  sequence  of  words 
* ’day” — “and”) , \ hereits  a hypothecation  is  vertical  when 

KS  uses  information  at  one  level  in  the  data  base  to 
diet  new  hypotheses  at  a different  level  (c.g.,  the  gen- 
eration of  a hypothesis  that  a [T]  occurred  when  a seg- 
ment of  silence  is  followed  by  a segment  of  aspiration). 

The  HSII  implementation  of  the  liypothesizc-and-tcst 
paradigm  has  also  resulted  in  a generalization  of  the  lock- 
step  control  scheme  for  KS  sequencing  employed  by  IISl. 
HSII  relaxes  the  constraints  on  the  hypothesize-and-test 
paradigm  and  allows  the  KS  processes  to  run  in  an  asyn- 
chronous, data-dircctcd  manner.  A 1\S  is  instantiated  as  a 
KS  process  whenever  the  data  base  exhibits  characteristics 
which  satisfy  a “precondition”  of  the  KS.  A precondition 
of  a KS  is  a description  of  sonic  partial  state  of  the  d it  a 
base  which  defines  when  and  where  the  KS  can  contribute 
its  knowledge  bv  modifying  the  data  base.  Such  a mod- 
ification might  be  adding  new  hypotheses  proposed  by 
the  KS  (at  the  information  level  appropriate  for  that  KS) 
or  verifying  (criticizing)  hypotheses  which  already  exist 


s This  latter  feature  refers  to  “eonoeetion  matrices”  ami  is  de- 
scribed below  in  more  detail 
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The  modifications  made  by  any  given  I\S  process  are 
expected  to  trigger  further  KS’s  by  creating  new  conditions 
in  the  data  base  to  which  those  KS’s  respond.  The  struc- 
ture of  a hypothesis  has  been  so  designed  as  to  allow  the 
preconditions  of  most  KS’s  to  be  sensitive  to  a single, 
simple  change  ill  some  hypothesis  (such  as  the  changing 
of  a rating  or  the  creation  of  a structural  link).  Through 
this  data-directed  interpretation  of  the  hypothesize-and- 
test  paradigm,  HSII  KS’s  exhibit  a high  degree  of  asyn- 
chrony and  potential  parallelism.  A side-effect  of  this 
more  general  control  scheme  for  I1SI1  is  that  the  overall 
problem-solving  strategy  need  not  be  centralized  and 
implemented  as  a monolithic  overlord,  but  rather  can  be 
implemented  as  policy  modules  which  operate  in  precisely 
the  same  manner  as  KS’s. 

The  three-dimensional  data  base,  augmented  by  the 
and/oh  structural  relationships  specified  over  that  data 
base,  permits  information  generated  by  one  KS  to  be: 

1)  retained  for  use  by  other  KS’s,  and  2)  quickie  prop- 
agated to  other  relevant  parts  of  the  data  base.  This 
retention  and  propagation  provide  two  important  features 
for  solving  a complex  problem  in  which  errors  are  highly 
likely.  First,  quick  refocusing  can  occur  when  a particular 
path  no  longer  appears  promising.  Second,  “selective” 
backtracking  may  be  used;  i.e.,  when  a KS  finds  that  it 
has  made  an  incorrect  decision,  it  does  not  have  to  elim- 
inate all  information  generated  subsequent  to  that  de- 
cision, but  rather  only  that  subset  which  depends  on  the 
incorrect  decision.  In  this  way,  information  generated  by 
one  knowledge  source  is  retained  and  is  usable  by  itself 
and  other  KS’s  in  other  relevant  contexts. 

Summarizing,  HSII  is  based  on  the  views:  1)  that  the 
state  of  the  recognition  can  be  represented  in  a uniform, 
multilevel  data  base,  and  2)  that  speech  knowledge  can 
be  characterized  in  a natural  manner  by  describing  many 
small  KS’s.  These  KS’s  react  to  certain  states  of  the  data 
base  (via  their  preconditions)  and,  once  instantiated  as 
KS  processes,  provide  their  own  changes  to  the  data  base 
which  contribute  to  the  progress  of  the  recognition.  The 
hypothesize-and-test  paradigm,  when  stated  in  sufficiently 
nonrest ricti vc  (parallel)  terms,  serves  to  describe  the  gen- 
eral interactions  among  these  KS’s.  In  particular,  changes 
made  by  one  or  more  KS  processes  may  trigger  other 
KS’s  to  react  to  these  changes  by  validating  (testing' 
them  or  hypothesizing  further  changes.  The  intent  of 
HSII  is  to  provide  a framework  within  which  to  explore 
various  configurations  of  information  levels,  KS’s,  and 
global  strategies.4 

From  a more  general  point  of  view,  the  goal  of  HSII  is 
to  provide  a muitiprocess-orientcd  software  architecture 
to  serve  as  a basis  for  systems  of  cooperating  (but  inde- 
pendent and  asynchronous)  data-directed  KS  processes. 
The  purpose  of  such  a structure  is  to  achieve  effective 
parallel  search  over  a general  artificial  intelligence  prob- 

* It  is  interesting  to  note  that  this  generalized  form  of  hy pnttiesi ze- 
al | test  leads  to  a system  organization  with  some  characteristics 
similar  to  <j.\4  [221  »nd  i’LANNKU  |8|.  In  particular,  there  are 
strong  similarities  in  the  data-directed  sequencing  of  processes.  _ 
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lcm-solving  graph,  employing  the  hvpothesizc-and-test 
paradigm  to  generate  the  search  graph  and  using  a uni- 
form, interconnected,  multilevel  global  data  base  as  the 
primary'  means  of  interprocess  communication. 

IISI1  SYSTEM  DESIGN  AND 
IMPLEMENTATION 

One  can  derive  from  the  description  of  the  desired  II.SII 
recognition  process  given  above  several  basic  components 
of  the  required  system  structure.  First,  a sufficiently 
general  structured  global  data  base  is  needed,  through  which 
the  KS’s  may  communicate  by  inserting  hypotheses  and 
by  inspecting  and  modifying  the  hypotheses  placed  there 
by  other  KS’s.  Second,  some  means  for  describing  the 
various  KS’s  and  their  internal  processing  capabilities 
is  required.  Third,  in  order  to  have  knowledge  sources 
activated  in  a data-directed  manner,  a method  is  required 
by  which  a sot  of  preconditions  may  bo  specified  and  as- 
sociated with  each  KS.  Fourth,  in  order  to  detect  the 
satisfaction  of  these  preconditions  and  in  order  to  allow 
KS’s  to  locate  parts  of  the  data  base  in  which  they  are 
interested,  two  mechanisms  arc  needed:  1)  a monitoring 
mechanism  to  record  where  in  the  data  base  changes  have 
occurred  and  the  nature  of  tnoso  changes,  and  2)  an 
associative  retrieval  mechanism  for  accessing  parts  of  the 
data  base  which  conform  to  particular  patterns  specified 
by  watching  prototypes. 

Elements  of  the  System  Structure 

The  following  sections  outline  the  IISU  implementation 
of  the  various  basic  system  components. 

Global  Data  Ease:  The  design  of  IISIl  is  centered  around 
a global  data  base  (blackboard)  which  is  accessible  to  all 
KS  processes.  The  global  data  base  is  structured  as  a 
uniform,  multilevel,  interconnected  data  structure. 

Each  level  in  the  data  base  contains  a (potentially 
complete)  representation  of  the  utterance;  the  levels  are 
differentiated  by  the  units  that  make  up  the  representa- 
tion, e.g.,  phrases,  words,  phonemes.  The  system  structure 
of  HSII  does  not  prespecify  what  the  levels  in  the  global 
data  structure  arc  to  be.  A particular  configuration,  called 
HSII-CO  (Configuration  Zero),  is  being  implemented  as 
the  first  test  of  the  I1S11  structure.  Fig.  1 shows  a sche- 
matic of  the  levels  of  1IS11-C0.  A more  detailed  descrip- 
tion and  justification  for  parts  of  this  particular  configura- 
tion can  be  found  in  [23].  This  configuration  will  ho  used 
as  the  basis  for  examples  to  illustrate  various  aspects  of 
the  HSII  system. 

Parametric  Level:  The  parametric  level  holds  the  most 
basic  representation  of  the  utterance  that  the  system  has; 
it  is  the  only  direct  input  to  the  machine  about  the  acous- 
tic signal.  Several  different  sets  of  parameters  are  being 
used  in  HSII-CO  interchangeably:  1 3-octavo  filter-band 
energies  measured  every  10  ms,  LPC-dcrivcd  vocal-tract 
parameters  [10],  and  wide-hand  energies  and  zero-cros- 
sing counts. 

Segmental  Level:  This  level  represents  the  utterance  as 
labeled  acoustic  segments.  Although  the  set  of  labels  may 
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be  phonetic-like,  the  level  is  not  intended  to  be  phonetic 
the  segmentation  and  labeling  reflect  acoustic  manifesta- 
tion and  do  not,  for  example,  attempt,  to  compensate  for 
the  context  of  the  segments  or  at t«  nipt  to  combine  acous- 
ticalh  dissimilar  segments  into  (phonetic)  units. 

As  with  all  levels,  any  particular  portion  of  the  utterance 
may  be  represented  by  more  than  one  competing  hy- 
pothesis (i.e.,  multiple  segmentations  and  labelings  may 
co-exist). 

Phonetic  Level:  At  this  level,  the  utterance  is  repre- 
sented by  a phonetic  description.  This  is  a broad  phonetic 
description  in  that  the  size  (duration)  of  the  units  is  on 
the  order  of  the  ''size”  of  phonemes;  it  is  a fine  phonetic 
description  to  the  extent  that  each  element  is  labeled 
with  a fairly  detailed  nllophnnic  classification  (e.g., 
“stressed,  nasalized  [ I]”). 

Surface-Phonemic  Level:  This  level,  named  by  seemingly 
contradicting  terms,  represents  the  utterance  by  phoneme- 
like units,  with  the  addition  of  modifiers  such  as  stress 
and  boundary  (word,  morpheme,  syllable)  markings. 

Syllabic  Level:  The  unit  of  representation  here  is  the 
syllable. 

Lexical  Level:  The  unit  of  information  at  this  level  is 
the  word.  (Note  again  that  ai  any  level  competing  repre- 
sentations can  be  accommodated.) 

Phrasal  Level:  Syntactic  elements  appear  at  this  level. 
In  fact,  since  a level  may  contain  arbitrarily  many  “sub- 
levels”  of  elements  structured  as  a modified  and/or 
graph,  traditional  kinds  of  syntactic  trees  can  be  directly 
represented  here. 

Conceptual  Level:  The  units  at  this  level  are  “concepts.” 
As  with  the  phrasal  level,  it  may  be  appropriate  to  use  the 
graph  structure  of  the  data  base  to  indicate  relationships 
among  different  concepts. 

The  basic  unit  in  the  data  structure  is  a node ; a node 
represents  the  hypothesis  that  a particular  element  exists 
in  the  utterance.  For  example,  a hypothesis  at  the  pho- 
netic level  may  be  labeled  as  “T  ” Besides  containing 
the  hypothesis  element  name,  a node  holds  several  other 
kinds  of  information,  including:  1)  a correlation  with  a 
particular  tune  period  in  the  speech  utterance,5 6  2)  schedul- 

5  This  time  period  is  specified  with  :ui  explicit  etimation  of  fuzzi- 

ness. even  to  the  extent  of  spanning  the  entire  utterance.  As  an 
extreme  example,  if  a setpienee  of  “phonemes"  is  being  lit  pothesized 
from  a “word  ’ at  the  lexical  level,  the  time  boundaries  of  the  pho- 
nemes are  specified  as  "fuzzy”  if  information  regarding  their  actual 
locations  is  not  yet  available, 


ing  parameters  (validity  ratings,  attention-focusing  factors, 
measures  of  computing  effort  expended,  etc  ),  and  3)  con- 
nection information  which  relates  the  node  to  other  nodes 
in  the  data  base. 

Structural  relationships  between  nodes  (hypotheses) 
are  represented  through  the  use  of  links;  links  provide  a 
means  for  specifying  contextual  abstractions  about  the 
relationships  of  hypotheses.  A link  is  an  element  in  the 
data  structure  which  associates  two  nodes  as  an  ordered 
pair;  one  of  the  nodes  is  termed  the  tipper  hypothesis,  and 
the  other  is  called  the  lower  hypothesis.  The  lower  h\  potlie- 
sis  is  said  to  support  the  upper  hypothesis;  the  upper  hy- 
pothesis is  called  a use  of  the  lower  one.  There  are  several 
types  of  links;  in  general,  if  a node  senes  as  the  upper 
hypothesis  for  more  than  one  link,  all  of  those  links  must 
be  of  the  same  type.  Thus,  one  can  talk  of  the  “type  of  the 
hypothesis,”  which  is  the  same  as  tin*  type  of  all  of  its 
lower  links.  The  two  most  important  structural  relation- 
ship types  are  sequence:  and  option. 

A sequence  node  is  a hypothesis  that  is  supported  by 
a (timewise)  sequential  set  of  hypotheses  at  a lower  level 
(or  sublevel— see  below).  For  example,  Fig.  2(a)  shows  a 
hypothesis  of  “will”  at  the  lexical  level  supported  by  the 
(time-) ordered  contiguous  sequence  of  “W,”  “IH,”  and 
“I,”  at  the  surface-phonemic  level.  The  interpretation 
of  a sequence  node  is  that  all  of  the  lower  hypotheses 
must  be  valid  in  order  to  support  the  upper  hypothesis. 

An  option  node  is  a hypothesis  that  has  alternative 
supports  from  two  or  more  hypotheses,  each  of  which 
covers  essentially  the  same  time  period.  For  example, 
Fig.  2(b)  shows  the  hypothesis  of  “noun”  at  the  phrasal 
level  as  being  supported  by  any  of  “boy,”  “toy,”  or  “tie,” 
all  of  which  are  competing  word  hypotheses  covering 
(approximately)  the  same  time  area.6 

Fig.  3 is  a example  of  a larger  fragment  of  the  global 
data  structure.  The  level  of  a hypothesis  is  indicated  by 
its  vertical  position,  the  names  of  the  levels  are  given  on 
the  left.  Time  location  is  approximately  indicated  by 
horizontal  placement,  but  duration  is  only  very  roughly 
indicated  (e.g.,  the  boxes  surrounding  the  two  hypotheses 
at  the  phrasal  level  should  be  much  wider).  Alternatives 
are  indicated  by  proximity;  for  example,  “will”  and 
“would”  are  word  hypotheses  covering  the  same  time 
span.  Likewise,  “question”  and  “modal-question,”  “youl” 
and  “vou2,”  and  "J"  and  “Y”  all  represent  pairs  of  alter- 
natives. 

This  example  illustrates  several  features  of  the  data 
structure. 

The  hypothesis  “you,”  at  the  lexical  level,  has  two 
alternative  phonemic  “spellings”  indicated;  the  hvpothe- 

* In  addition  to  sequence  and  option  there  are  two  kinds  of 
structural  relationships  which  are  generalizations  of  them  An  \ni> 
node  is  similar  to  a sequence  node  except  that  there  is  no  implica- 
tion of  any  time  sequentiality  among  the  supports  they  ina\ 
overlap  or  he  disjoint.  An  ok  node  is  similar  to  uu  option  node  in 
that  t lie  supports  represent  alternatives,  but  (as  with  the  anij  node) 
there  is  no  time  requirement. 
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ses  labeled  ' you  I”  ami  “you2”  arc  nodes  created,  also  at 
the  lexical  level,  to  hold  those  alternatives.  In  general, 
sui  h liublti'cls  may  he  created  arbitrarily. 

Tl  i link  between  “you!"  and  “D”  is  a special  kind  of 
KKiji'iAcr.  link  (indicated  here  by  a dashed  line)  called  a 
content  link;  a context  link  indicates  that  the  lower 
hypothesis  supports  the  upper  one  and  is  contiguous  to 
its  brother  links,  but  it  is  not  “part  of”  the  upper  hypothe- 
sis in  the  sense  that  it  is  not  w ithin  the  time  interval  of  the 
upper  hypothesis  rather,  it  supplies  a context  for  its 
brother (s).  In  this  ease,  one  may  “read”  the  structure  as 
stating  “ “yon  1 ” is  composed  of  “J”  followed  by  “AX” 
(schwa;  in  the  context  of  the  preceding  “D.”  ” (This 
roHects  the  phonological  ride  that  “would  you”  is  often 
spoken  as  “wouhl-ja.”)  Thus,  a context  link  allows  im- 
portant contextual  relationships  to  be  represented  without 
violating  the  implicit  time  assumptions  about  sequence 
nodes 

Whereas  the  phonemic  spelling  of  the  word  “you”  held 
by  “youl”  includes  a contextual  constraint,  the  “you2” 
option  does  not  have  this  constraint.  However,  “youl” 
and  “you2”  are  such  similar  hypotheses  that  there  is 
strong  reason  for  wanting  to  retain  them  as  alternative 
options  under  “you”  (as  indicated  in  big.  3),  rather  than 
representing  them  uiieonneetedlv.  In  general,  the  problem 
is  that  the  use  of  a hypothesis  implies  certain  contextual 
assumptions' about  its  environment,  while  the  support  of 
a hypothesis  may  itself  be  predicated  on  a particular  set 
of  contextual  assumptions.  A mechanism,  called  a con- 
nection matrix,  exists  iri  the  data  structure  to  represent 
this  kind  of  relationship  by  specifying,  for  an  option 
by  pot  In  sis,  which  of  its  alternative  supports  are  applicable 
(“connected”)  h,  which  of  its  uses.  In  this  example,  the 
connection  matrix  of  “you”  (symbolized  in  Fig.  3 by  the 
two-dimensional  binary  matrix  in  the  node)  specifies 
that  support  “youl”  is  relevant  to  use  “question”  (but 
not  to  “modal -question”)  and  that  support  “you2”  is 
relevant  to  both  uses.  The  use  of  a connection  matrix 
allows  the  efforts  of  preceding  KS  decisions  to  be  ac- 
cumulated for  future  use  and  modification  without  neces- 
sitating context  mil  duplication  of  parts  of  the  data  base. 
Thus,  much  of  the  duplication  of  effort  due  to  the  back- 
tracking mode  of  HSI  is  avoided  in  11SI1. 

Besides  showing  structural  relationships  (i.e.,  that  one 
unit  is  composed  of  several  other  units),  a link  is  a state- 
ment about  the  degree  to  which  one  hypothesis  implies 
(i.e  gives  evidence  for  tin  existence  of)  another  hypothe- 
sis. The  strength  of  the  implication  is  held  as  information 
on  the  link  The  sense  of  the  implication  may  he  negative; 
that  is  a link  may  indicate  that  one  hypothesis  is  evidence 
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for  the  invalidity  of  another.  Finally,  this  statement  of 
implication  is  bidirectional;  the  existence  of  the  upper 
hypothesis  may  give  credence  to  the  existence  of  the  lower 
hypothesis  and  vice  versa. 

The  nature  of  the  implications  represented  by  the  links 
provides  a uniform  basis  for  propagating  changes  made  in 
one  part  of  the  data  structure  to  other  relevant  parts 
without  necessarily  requiring  the  intervention  of  particu- 
lar KS’s  at  each  step.  Considering  the  example  of  Fig.  3, 
assume  that  the  validity  of  the  hypothesis  labeled  "J”  is 
modified  by  some  KS  (presumably  operating  at  the  phonet- 
ic level)  and  becomes  very  low.’  One  possible  scenario  for 
rippling  this  change  through  the  data  base  is  the  fol- 
lowing. 

First,  the  estimated  validity  of  “youl”  is  reduced, 
because  “J”  is  a lower  hypothesis  of  “youl.” 

This,  in  turn,  may  cause  the  rating  of  “you”  to  be  re- 
duced. 

The  connection  matrix  at  “you”  specifies  that  “youl” 
is  not  relevant  to  “modal-question,”  so  the  latter  hypothe- 
sis is  not  affected  by  the'  change  in  rating  of  the  former. 
Notice  that  the  existence  of  the  connection  matrix  allows 
this  decision  to  be  made  locally  in  the  data  structure, 
without  having  to  search  hack  down  to  the  “D”  and  “J.” 

“Question,”  however,  is  supported  by  "youl”  (through 
the  connection  matrix  at  “you”),  so  its  rating  is  affected. 

Further  propagations  can  continue  to  occur,  perhaps 
down  the  other  sequence  links  under  “question”  and 
“youl.” 

KS  Specification:  A knowledge  source  is  specified  in 
three  parts:  1)  the  conditions  under  which  it.  is  to  be  acti- 
vated (in  terms  of  the  data  base  conditions  in  which  it  is 
interested,  as  described  in  the  section  on  “preconditions” 
below),  2)  the  kinds  of  changes  it  makes  to  the  global  data 
base,  and  ■'<)  a procedural  statement  (program)  of  the 
algorithm  which  accomplishes  those  changes.  A KS  is 
thus  defined  as  possessing  some  processing  capability  which 
is  able  to  solve  some  subproblem,  given  appropriate  cir- 
cumstances for  its  activation. 

The1  decomposition  of  the  overall  recognition  task  into 
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various  knowledge  sources  is  regarded  as  being  a natural 
decomposition  That  is,  the  units  of  the  decomposition 
represent  those  pieces  of  knowledge  which  can  he  distin- 
guished and  recognized  as  being  somehow  naturally  inde- 
pendent.7 Given  a sullicient  set  of  such  KS’s  (that  is,  n 
set  that  provides  enough  overall  connectivity  among  the 
various  levels  of  the  data  base  that  a final  recognition  can 
be  attained),  the  collection  is  called  the  “overall  recogni- 
tion system.'’  Such  a scheme'  of  inverse  decomposition” 
(or,  composition)  seems  very  natural  for  many  problem- 
solving  tasks,  and  it  fits  well  into  the  hypothesize-and-test 
approach  to  problem-solving.  As  long  as  a sufficient  “cov- 
ering set”  for  the  data  base  connections  is  maintained,  one 
can  freely  add  new  KS’s,  or  replace  or  delete  old  ones. 
Kaeli  KS  is  in  some  sense  self-contained,  hut  each  is  ex- 
pected to  cooperate  with  the  other  IvS’s  that  happen  to 
fie  present  ill  the  system  at  that  time. 

As  examples  of  KS’s,  fig.  4 shows  the  first  set  of  pro- 
cesses implemented  for  HSU  CO.  The  levels  are  indicated 
as  horizontal  lilies  in  the  figure  and  are  labeled  at  the  left. 
The  KS’s  are  indicated  by  ares  connecting  levels;  the 
starting  point (s)  of  an  are  indicates  the  level (s)  of  major 
“input”  for  the  KS,  and  the  end  point  indicates  the  “out- 
put” level  where  the  KS's  major  actions  occur.  In  general, 
the  action  of  most  of  these  particular  KS’s  is  to  create 
links  between  hypotheses  on  its  input  level (s)  and:  1) 
existing  hypotheses  on  its  output  level,  if  appropriate 
ones  are  already  there,  or  2)  hypotheses  that  it  creates  on 
its  output  level. 

The  Segmentcr-Classifier  KS  uses  the  description  of  the 
speech  signal  to  produce  a labeled  acoustic  segmentation. 
(Sec  [7]  for  a description  of  the  algorithm  being  used.) 
For  any  portion  of  the  utterance,  several  possible  alter- 
native segmentations  and  labels  may  be  produced. 

The  Phone  Synthesizer  uses  labeled  acoustic  segments  to 
generate  elements  at  the  phonetic  level.  This  procedure  is 
sometimes  a fairly  direct  renaming  of  an  hypothesis  at 
the  segmental  level,  perhaps  using  the  context  of  adjacent 
segments.  In  other  cases,  phone  synthesis  requires  the 
combining  of  several  segments  (e.g.,  the  generation  of 
[t]  from  a segment  of  silence  followed  by  a segment  of  as- 
piration) or  the  insertion  of  phones  not  indicated  directly 
by  the  segmentation  (e.g.,  hypothesizing  the  existence  of 
an  [I]  if  a vowel  seems  velarized  and  there  is  no  [I]  in  the 
neighborhood).  This  KS  is  triggered  whenever  a new  hy- 
pothesis is  created  at  the  segmental  level. 

The  Word  Candidate  Generator  uses  phonetic  information 
(primarily  just  at  stressed  locations  and  other  areas  of 
high  phonetic  reliability)  to  generate  word  hypotheses. 
This  is  accomplished  in  a two-stage  process,  with  a stop  at 
the  syllabic  level,  from  which  lexical  retrieval  is  more 
effective. 


7 The  approach  taken  in  knowledge  source  decomposition  is  not  an 
attempt  to  characterize  somehow  the  overall  recognition  process  and 
then  apply  some  sort  of  traffic  flow  analysis  to  its  internal  workings 
in  order  to  decompose  the  total  process  into  minimally  interacting 
KS's  I tat  her,  KS’s  are  defined  by  starting  with  some  intuitive 
nut  ion  about  the  various  pieces  of  knowledge  which  could  be  in- 
corporated i'i  a useful  way  to  help  achieve  a solution. 
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Fig  1 First  KS's  f ,r  HSU.ro. 


The  Syntactic  Word  II  ypothesizer  uses  knowledge  ;it  the 
phrasal  level  to  predict  possible  new  words  at  the  lexical 
level  which  are  adjacent  (left  nr  right)  to  words  previously 
generated  at  the  lexical  level.  ([10]  contains  n description 
of  the  probabilistic  syntax  method  being  used  here.)  This 
KS  is  activated  at  the  beginning  of  an  utterance  recogni- 
tion attempt  and,  subsequently,  whenever  a mm  word  is 
created  at  the  lexical  level. 

The  Phoneme  II ypothesizer  KS  is  activated  whenever  a 
word  hypothesis  is  created  (at  the  lexical  level)  which  is 
not  vet  supported  by  hypotheses  at  the  surface-phonemic 
level.  Its  action  is  to  create  one  or  more  sequences  at  the 
surface-phonemic  level  which  represent  alternative  pro- 
nunciations of  th(>  word.  (These  pronunciations  are 
currently  pre, specified  as  entri  , in  a dictionary.) 

The  Phone-Phoneme  Synch) onizer  is  triggered  whenever 
a hypothesis  is  created  at  either  the  phonetic  or  the 
surface-phonemic  level.  This  KS  attempts  to  link  up  the 
new  hypothesis  with  hypotheses  at  the  other  level  This 
linking  may  be  manv-to-one  in  either  direction. 

The  Syntactic  Parser  uses  a syntactic  definition  of  the 
input  language  to  determine  if  a complete  sentence  nmy 
be  assembled  from  words  at  the  lexical  level. 

Fig.  it  shows  the  initial  HSII-C0  KS’s  of  Fig.  4 aug- 
mented with  four  additional  ones  which  are  being  ir  ple- 
mented  or  are  planned. 

The  Semantic  Word  Hypothesize)'  uses  semantic  and 
pragmatic  information  about  the  task  (news  retrieval, 
in  this  case)  to  predict  words  at  the  lexical  level. 

The  Phonological  Rule  Appher  rewrites  sequences  at 
the  surface-phonemic  level.  This  KS  is  used.  1)  to  augment 
the  dictionary  lookup  of  the  Phoneme  Hypothesizcr,  and 
2)  to  handle  word  boundary  conditions  that  can  be  pre- 
dicted by  rule. 

The  primary  duties  of  the  Segment-Phone  Synchronizer 
and  the  Parameter-Segment  Synchronizer  are  similar  to 
recover  from  mistakes  make  bv  the  (bottom-up)  actions 
of  the  Phone  Synthesizer  and  Segmenter-Classiher,  rc- 
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spectively,  by  allowing  feedback  ifom  the  higher  to  the 
lower  level. 

The  introduction  of  these  knowledge  sources  indicates 
the  modularity  and  extendabitity  of  the  system  in  terms 
of  both  KS’s  and  levels.  In  particular,  notice  that  even 
though  the  purpose  of  some  KS  is  stated  as  “correcting 
errors  produced  by  other  KS’.s,”  each  KS  is  independent  of 
the  others.  Yet  additional  KS’s  will  be  added  to  the  con- 
figuration as  the  need  for  them  is  seen  and  as  ideas  for 
their  implementation  are  developed. 

In  addition  to  the  KS  modules  described  above,  all  of 
which  embody  speech  knowledge,  several  policy  modules 
exist.  These  modules,  which  interface  to  the  system  in  a 
maimer  identical  to  the  speech  modules,  execute  policy 
decisions,  e.g.,  propagation  of  ratings,  focusing  of  attention, 
and  scheduling  of  other  modules. 

Matching-Prototypes  and  Associative  Retrieval:  Tho 
synchronous  processing  activity  in  HSII  is  primarily 
data-dircetod;  this  implies  the  need  for  some  mechanism 
whereby  one  can  retrieve  parts  of  the  global  data  base 
in  an  associative  manner.  HSII  provides  primitives  for 
associative!}-  searching  the  data  base  for  hypotheses  sat- 
isfying speeihed  conditions  (e.g.,  finding  all  hypotheses 
at  the  phonetic  level  which  contain  a vowel  within  a cer- 
tain time  range).  The  search  condition  is  specified  by  a 
matching  prototype,  which  is  a partial  specification  of  the 
components  of  a hypothesis.  This  partial  specification 
permits  a component  to  be  characterized  by:  1)  a set  of 
desired  values,  or  2)  a don’t-cark  condition,  or  3)  values 
of  components  of  a hypothesis  previously  derived  by 
r latching  another  prototype.  A matching  prototype  can  be 
compared  against  a set  of  hypotheses.  Those  hypotheses 
whose  component  values  match  those  specified  by  the 
matching  prototype  are  returned  as  the  result  of  the 
search.  Associative  retrieval  of  structural  relationships 
among  in  pothescs  is  also  provided  by  several  primitives. 


More  complex  retrievals  can  be  accomplished  by  com- 
bining the  retrieval  primitives  in  appropriate  wavs. 

Preconditions  and  Change  Sets:  Associated  with  every 
1\S  is  a specification  of  the  data  base  conditions  required 
for  the  instantiation  of  that  KS.  Such  specifications, 
eal  ed  preconditions,  conceptually  form  an  and  on  tree 
composed  of  matching  prototypes  and  structural  re- 
lationships which,  when  applied  to  the  data  base  in  an 
associative  manner,  detect  the  regions  of  the  data  base 
in  which  the  KS  is  interested  (if  the  precondition  is  cap- 
able of  being  satisfied  at  that  time).  Alternatively,  one 
might  think  of  the  precondition  specification  as  a proce- 
dure, involving  matching  prototypes  and  structural  re- 
lationships, which  effectively  evaluates  a conceptual 
and  on  tree.  This  procedure  may  contain  arbitrarily 
complex  decisions  (based  on  current  and  past  modifica- 
tions to  the  data  base)  resulting  in  the  activation  of  de- 
sired KS’s  within  the  chosen  contexts.  The  context  cor- 
responding to  the  discovered  data  base  region  which 
satisfies  some  KS’s  precondition  is  used  as  an  initial  con- 
text in  which  to  instantiate  that  KS  as  a new  process. 
If  then1  are  multiple  regions  in  the  data  base  that  satisfy 
the  specified  conditions,  the  KS  can  be  separately  in- 
stantiated for  each  context,  or  once  with  a list  of  all  such 
contexts. 

Whenever  any  KS  perfoi  ns  a modification  to  the  global 
data  base,  the  essence  of  th  > modification  is  preserved  in 
a change  set  appropriate  to  the  change  made  (e.g.,  one 
change  set-  records  rating  changes,  while  another  records 
time  range  changes).  Change  sets  serve  to  categorize 
data  base  modifications  (events)  and  are  thus  useful  in 
precondition  evaluation  since  they  limit  the  arens  in  the 
data  base  that  need  to  be  examined  in  detail.  As  currently 
implemented,  precondition  evaluators  exist  as  a set  of  pro- 
cedures which  monitor  changes  in  the  data  base  (i.e.,  they 
monitor  additions  to  those  change  sets  in  which  they  are 
interested).  These  precondition  procedures  are  themselves 
data-directed  in  that  they  are  applied  whenever  sufficient 
changes  have  been  made  in  the  global  data  base.  In  effect, 
the  precondition  procedures  themselves  have  precon- 
ditions, albeit  of  a much  simpler  form  than  those  possible 
for  KS’s.  For  example,  a precondition  procedure  may 
specify  that  it  should  be  invoked  (by  a system  precondition 
invoker)  whenever  changes  occur  to  two  adjacent  hy- 
potheses at  the  word  level  or  whenever  support  is  added  to 
the  phrasal  level.  By  using  the  (coarse)  classifications 
afforded  by  change  sets,  the  system  avoids  most  unneces- 
sary data  base  examinations  by  the  precondition  pro- 
cedures. The  major  point  to  be  made  is  that  the  scheme  of 
precondition  evaluation  is  event  driven,  being  based  on  the 
occurrence  of  changes  in  the  global  data  base.  In  particu- 
lar, precondition  evaluators  are  not  involved  in  n form  of 
busy  waiting  in  which  they  arc  constantly  looking  for 
something  that  is  not  yet  there. 

Once  invoked,  a precondition  procedure  uses  sequences 
of  associative  retrievals  and  structural  matches  on  the 
data  base  in  an  attempt  to  establish  a context  satisfying 
the  preconditi'  ns  of  one  or  more  of  “its”  KS’s;  any  given 
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precondition  procedure  may  lie  responsible  for  instantiat- 
ing several  (usually  related)  KS’s  Notice  tlint  the  data- 
( limited  nature  of  precondition  evaluation  and  KS  in- 
stantiation is  linked  closely  to  the  primitive  functions  that 
are  able  to  modify  the  data  base,  for  it  is  only  at  points  of 
modification  that  a precondition  that  was  unsatisfied  be- 
fore may  become  satisfied  Hence,  data  base  modification 
routines  have  the  responsibility  (although  perhaps  in- 
directly; of  activating  the  precondition  evaluation  mecha- 
nism 8 

M uniprocessing  Considerations:  Some  Problems 
and  Their  Solutions 

A primary  goal  in  the  design  of  HSII  is  to  exploit  the 
potential  parallelism  of  the  Hearsay  system  model  as 
fully  as  possible.  Several  issues  associated  with  the  in- 
troduction of  parallel  KS  processes  into  HSII  will  be 
described  and  their  current  solutions  outlined. 

Local  Contexts:  A precondition  evaluator  (process)  is 
invoked  bast'd  on  the  occurrence  of  certain  changes  which 
have  taken  place  in  the  global  data  base  s.,iee  the  last 
time  the  evaluator  was  invoked;  these  changes,  together 
with  the  state  of  the  relevant  parts  of  the  global  data 
base  at  the  instant  at  which  the  precondition  ('valuator  is 
invoked,  form  a local  context  within  which  the  evaluator 
operates.  Conceptually,  at  the  instant  of  its  invocation, 
the  precondition  evitluator  takes  a snapshot  of  the  global 
data  base  and  saves  the  substance  of  the  changes  that  have 
occurred  to  that  moment,  thereby  preserving  its  local 
context. 

The  necessity  of  preserving  this  local  context  exists 
for  several  reasons:  1)  HSII  consists  of  asynchronous 
processes  sharing  a common  global  data  base  which  con- 
t lins  only  the  most  current  data  (that  is,  no  history  of 
data  modification  is  preserved  in  the  global  data  base), 
2)  since  the  precondition  evaluators  are  also  executed 
asynchronously,  each  evaluator  is  interested  only  in 
changes  in  the  data  base  which  have  occurred  since  the 
last  time  that  particular  evaluator  was  executed  (that  is, 
the  relevant  sot  of  changes  for  a particular  precondition 
evaluator  is  time-dependent  on  the  last  execution  of  that 
evaluator),  and  3)  further  modifications  to  the  global  data 
base  which  are  of  interest  to  a given  KS  process  may  occur 
between  the  time  of  that  KS  process’s  instantiation  and 
the  time  of  its  completion  (in  particular,  such  modifica- 
tions and  their  relationship  to  data  base  values  which 
existed  at  the  time  of  the  instantiation  of  the  KS  process 
may  influence  the  computation  of  that  KS  process). 
Hc'iiec,  the  problem  of  creation  of  local  contexts  exists, 
as  do  the  associated  problems  of  signaling  a KS  process 

Mine  might  think  of  IIKII  ns  n production  system  (14]  which  is 
executed  asynchronously  The  preconditions  correspond  to  the  left- 
hand  sides  (conditions)  of  productions,  and  the  knowledge  sources 
correspond  to  the  right-hand  sides  (actions)  of  the  productions. 
Conceptually,  these  left-hand  sides  arc  evaluated  eontinumislv. 

\\  lieu  a precondition  is  satisfied,  an  instantiation  of  the  correspond- 
ing right  hand  side  of  its  production  is  created;  this  instantiation  is 
executed  at  some  arbitrary  subsequent  time  (perhaps  subject  to 
instantiation  scheduling  constraints). 


T 1 • precondition  ev*lu»tor  PRE 

(Uiggeteu  Of  base  changes) 

T2;  PRE  instantiates  a knowledge  source  process  KS 

T3  end  PRE 

T4:  start  KS 

T5:  after  K$  revaluation  of  precondition 

<compolalion> 

T6;  KS  modifies  global  data  base 
<COrrputatiOn> 

T7:  KS  modifies  global  data  base  agam 

T8:  end  KS 

Kig  h.  Tuning  sequence  of  a precondition  t valuator  and  KS 

when  its  local  context  is  no  longer  valid  and  of  updating 
these  contexts  as  further  changes  occur  in  the  global  data 
base 

Consider  the  time  sequence  of  events  shown  in  log.  6. 
The  precondition  evaluator  pub  is  activated  to  respond 
to  changes  occurring  in  the  global  data  base,  pit i-  should 
execute  in  the  context  of  changes  existing  at  time  Tl 
(since  that  context  contains  the  changes  which  caused 
I'ltK  to  be  activated),  ks  is  instantiated  (readied  for  run- 
ning) at  T2  due  to  further  conditions  PitK  discovered 
about  the  change  context,  of  Tl.  Hence,  run  should  pass 
the  context  of  Tl  as  the  initial  environment  in  which  to 
run  ks. 

By  time  T4,  when  ks  actually  starts  to  execute,  other 
changes  could  have  occurred  in  the  global  data  base  due 
to  the  actions  of  other  KS  processes.  So  ks  should  ex- 
amine those  new  updating  changes  (those  occurring  be- 
tween Tl  and  T4)  and  revalidate  its  pr  ‘condition,  if 
necessary.  After  revalidation,  ks  assumes  the  updated 
context  of  To,  and  it  proceeds  to  base  its  computation  on 
the  context  of  changes  as  of  To. 

When  ks  wishes  to  perform  an  actual  update  of  ele- 
ments of  the  global  data  base  at  TO,  it  must  examine  the 
changes  to  the  global  data  base,  that  have  occurred  be- 
tween T5  and  TO  to  see  if  any  other  I\S  processes  may 
have  violated  ks’s  preconditions,  thereby  invalidating  its 
computations.  Having  performed  this  revaluation  and 
any  data  base  updating,  ks  should  update  its  context  to 
reflect  changes  up  to  TO  for  use  in  its  further  computation. 
At  T7,  ks  must  look  for  further  possible  invalidations 
to  its  most  recent  computations,  due  to  possible  changes 
in  the  global  data  base  by  other  KS  processes  during  the 
time  period  T6  to  T7.  When  ks  (w  hich  is  an  instantiation 
of  some  KS)  completes  its  actions  at  TS,  its  local  context 
may  be  deleted. 

Changes  occurring  between  instantiations  of  a KS  arc 
accumulated  in  the  local  contexts  of  the  precondition 
evaluators  and  may  become  part  of  the  local  context  of  a 
future  instantiation  of  a KS.  Thus,  the  precondition 
evaluators  are  always  collecting  data  base  changes  (since 
these  evaluators  are  permanently  instantiated),  while- 
individual  KS  instantiations  accumulate  data  base  changes 
only  during  their  transient  existence. 

In  practice,  to  create  local  contexts  one  need  only  save 
the  contents  of  changes  which  occur  in  the  global  data 
base  (thereby  allowing  tlu*  global  data  base  to  contain 
oniv  the  very  latest  values).  In  particular,  no  massive  copy 
operations  involving  the  global  data  base  are  required. 
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'1  liu<,  for  I nil  il.'ita  base  event  caused  by  a modification 
)irimitivr,  the  associated  change9  is  distributed  (copied) 
into  the  local  contexts  (which  can  now  be  characterized  as 
[oral  change  sets.'0  referring  to  the  previous  discussion  on 
change  sets)  of  all  KS  processes  and  precondition  evalua- 
tors who  care  N<  tin  that  not  every  KS  process  and  pre- 
condition evaluator  cares  about  every  change  that  takes 
place  For  example,  a fricative  detector  will  not  care  about 
a data  base  change  associated  with  the  grouping  of  several 
words  to  form  a syntactic  phrase,  but  it  is  interested  in  the 
hvpothesization  of  a word  whose  phonemic  spelling  con- 
tains a fricative. 

In  order  for  a KS  (or  precondition  evaluator)  process  to 
be  notified  of  changes  which  occur  to  particular  fields  of 
particular  nodes  which  are  in  its  local  context,  those  fields 
need  to  be  tagged  with  an  alias  (called  a tag iel)  belonging  to 
that  KS  instantiation,' Then,  whenever  a modification  is 
made  t-  the  global  data  base,  a message  signaling  the 
cluing,  is  sent  to  all  who  have  tagged  the  field  being 
changed  In  addition,  the  contents  of  the  change  are  also 
distribi  i d to  the  local  contexts  of  those  KS’s  receiving 
the  v < age.  This  data  field  tagging  may  be  requested  by 
cite  i ireeondition  evaluator  which  is  about  to  instanti- 
ate . Kc;  based  on  the  values  of  particular  fields  (which 
represent  the  context  satisfying  the  precondition),  or  by 
a KS  process,  once  instantiated,  which  may  request  addi- 
tional fields  to  be  tagged. 

For  example,  in  its  search  through  the  global  data  base 
for  conditions  satisying  the  preconditions  of  some  KS, 
a precondition  evaluator  may  accumulate  references  to 
the  data  fields  which  it  has  examined;  and  when  the  entire 
precondition  has  been  found  to  be  satisfied,  the  precon- 
dition evaluator  tags  those  fields  (for  which  it  has  accumu- 
lated references)  with  the  name  of  the  KS  process  it  is 
about  to  instantiate.  Similarly,  having  been  invoked,  a 
KS  process  might  wish  to  do  certain  computations,  but 
only  if  certain  data  fields  are  not  altered;  the  I\S  process 
itself  can  tag  these  fields  and  thereby  be  informed  of 
subsequent  tampering  with  the  tagged  fields.  Subsets  of 
these  tagged  fields  (the  subsets  being  formed  according 
to  the  tagid’s  chosen)  form  critical  sets  (specifying  the 
fields  of  the  local  context  for  the  KS  process)  which  arc 
locked  (see  below)  every  time  the  KS  process  wishes  to 
modify  the  global  data  base.  Thus,  after  having  locked  the 
critical  set  and  prior  to  performing  any  update  operations, 
the  KS  process  can  check  to  see  whether  any  other  KS 
process  has  made  any  changes  which  might  invalidate  the 
current  KS\s  assumed  context  (and  hence,  perhaps,  in- 


» Tlie  information  which  defines  the  change  consists  of  the  locus 
of  the  change  (i.e.,  n node  name  and  a field  name)  and  the  old  value 

of  the  field.  . . , ..  , 

1(1  \ ii  alternative  to  replicating  the  change  information  for  the 
various  KS  processes  is  to  maintain  a single  central  copy  of  those 
data  passing  onlv  references  to  the  centralized  items  to  the  various 
local  contexts  The  individual  change  items  may  then  be  deleted 
from  their  central  change  sets  whenever  there  are  no  further  out- 
standing references  to  that  change. 


validate  its  proposed  update).11  1 or  example,  if  a KS 
process  is  verifying  a hypothesis  in  the  data  base  because 
its  rating  exceeds  50  (i.e.,  the  rating  value  represents  part 
of  the  local  context  for  the  KS  process),  then  before 
performing  any  modifications  on  the  data  base  which 
depend  on  the  assumption,  the  KS  process  should  check  to 
be  sun  that  no  other  KS  process  has  invalidated  the 
rating  assumption  in  the  meantime. 

In  addition  to  maintaining  local  copies  of  the  various 
relevant  changes  that  have  occurred  since  the  instantia- 
tion of  a KS  process,  a KS  process  may  also  include  in 
its  context  various  supersets  of  these  changes.  For  example, 
if  a knowledge  source  is  interested  in  the  values  of  changes 
in  a node’s  bcgin-tinie  or  end-time,  it  might  be  worth- 
while to  include  a superset  which  accumulates  indicators 
pointing  to  changes  in  either  begin-time  or  end-time  (since 
checking  such  a superset  for  new'  elements  would  be  quicker 
than  cheeking  each  constituent  subset  individually).  Note 
that  supersets  do  not  themselves  cont  m any  data  values, 
but  rather  are  pointers  to  changes  contained  in  some  one 
of  its  subsets.  Further  note  that  a KS  need  not  contain  in 
its  local  context  all  (or  any)  of  the  constituent  subsets  of 
a superset  in  order  to  have  that  superset  in  its  local  context. 

Data  Base  Deadlock  Prevention:  Any  KS  process  may 
request  exclusive  access  to  some  collection  of  fields  at  any 
time.  Thus,  unless  some  care  is  taken  in  ordering  the  re- 
quests for  such  exclusive  access,  t lie*  possibility  of  getting 
into  a deadlock  situation  exists  (where,  for  example,  one 
KS  process  is  waiting  for  exclusive  access  to  a field  already 
held  exclusively  by  a second  KS  process,  and  vice  versa, 
resulting  in  neither  process  being  able  to  proceed).  Ap- 
plying a linear  ordering  to  the  set  of  lockable  fields  and 
requesting  exclusive  access  according  to  that  ordering  is 
a commonly  used  means  of  deadlock  avoidance  in  re- 
source allocation.  However,  this  technique  works  only  if 
all  the  resources  (fields)  to  be  locked  are  known  ahead  of 
time.  The  ability  to  tag  a data  field  allows  a KS  process  to 
loca’e  and  examine  in  arbitrary  order  the  set  of  hypotheses 
that  will  form  the  context  for  a data  base  modification  and 
then,  only  after  the  entire  set  of  desired  hypotheses  (and 
links)  has  been  determined,  to  lock  the  desired  set  and 
perform  the  modification. 

Regarding  the  global  data  base  as  having  orthogonal 
dimensions  of  information  level  versus  utterance  time 
further  allows  a convenient  means  of  locking  entire  re- 
gions of  the  data  base  without  explicitly  naming  the  nodes 
or  fields  to  be  affected.  This  “blanket-locking”  is  called 
region  locking.  For  example,  a KS  process  might  request 
the  locking  of  a region  consisting  of  the  time  interval  from 
50  to  SO  ccntiscconds  at  the  phonetic  level  To  assist  in  de- 


11  The  characterization  of  local  contexts  according  to  specific  data 
fields  (which  is  made  possible  In  part  by  the  choice  of  levels  in  the 
global  data  base)  helps  to  minimize  the  overhead  associated  with 
context  maintenance.  Also,  the  smaller  the  context,  the  more  flexible 
the  scheduling  strategy  may  he  (since  it  needs  to  he  less  concerned 
with  the  time  required  fur  a context  swap  on  a processor). 
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icnniniug  such  regions,  each  node  in  the  kIi >l»:il  data  base 
contains  a hit  vector  whose  high-order  hits  describe  the 
time  regions  covered  by  the  node  and  whose  low  -order  hits 
contain  tlx  integer  level  number  of  that  node.  Those  region 
fields  mav  he  manipulated  to  allow  locking  of  various  re- 
gions in  the  (lata  base  Note  that  u region-lock  may  he  used 
to  obtain  exclusive  access  to  a time  region  which  need 
not  contain  any  nodes  yet  i one  may  region-lock  a 
“vacuum” ) 

To  eliminate  the  possibilities  of  deadlock,  the  actual 
locking  operation  is  relegated  to  a system  primitive,  and  a 
KS  process  is  required  to  present  to  the  locking  primitive 
the  entire  set  of  fields  to  which  it  wants  exclusive  access. 
This  set  is  then  extended  to  include  all  fields  in  the  critical 
set  of  the  calling  KS  process',  for  the  reasons  relating  to 
context  invalidation  given  above.  The  system  locking 
primitive  can  then  request  exclusive  access  for  this  union 
of  data  tields,  on  behalf  of  the  KS  process,  according  to  a 
universal  ordering  scheme  (such  an  ordering  being  possible 
since'  every  nodi'  in  the  global  data  base  essentially  has  a 
unique  serial  number)  which  assures  that  no  deadlocks 
occur.  Having  been  granted  exclusive  access  to  all  desired 
fields  at  once,  the  KS  process  may  then  check  to  see 
whether  there  have  been  any  changes  to  the  tagged  data 
fields.  If  there  have  been  none,  it  can  proceed  to  perforin 
its  modifications  (which  modifications  are  sent  to  the  local 
contexts  of  others  who  care  about  such  things).  However, 
if  there  were  changes,  the  KS  process  can,  after  examining 
the  changed  data  lields,  decide  whether  it  still  wants  to 
perform  the  modifications.  When  the  KS  process  is  finished 
updating  the  data  base,  it  releases  all  its  locked  fields  by 
executing  a system  primitive1  unlocking  operator.  In  par- 
ticular, the  system  ensures  that  a KS  process  will  not 
request  two  lock  operations  without  issuing  an  intervening 
unlock  operation.  In  this  manner,  any  possibility  of  a data 
base  deadlock  is  eliminated. 

Goal-Directed  Scheduling:  The  computational  sequence 
of  a typical  I1S11  run  may  he  characterized  by  considering 
the  chronological  sequence  of  states  of  theutterance  recog- 
nition at  any  particular  data  base  level.  For  example,  if 
one  considers  the  efforts  in  producing  the  word  level  and 
traces  the  development  of  the  “best”  partial  sentences,  the 
processing  that  will  have  been  done  is  analogous  to  a 
general  tree-search,  where  each  node  of  the  tree  represents 
some  partially  completed  sentence  (with  the  eventual 
resultant  sentence  being  one  of  the  leaves  of  the  tree).  The 
problem  now  is  to  guide  this  tree  searching  so  as  to  find 
the  answ  er  leaf  in  an  optimal  way  (according  to  some  meas- 
ure of  optimality).  To  achieve  this  end,  ratings  arc  asso- 
ciated with  every  hypothesis  and  link  in  the  global  data 
base  (and  thus  with  every  partial  sentence  node  of  the 
analogous,  search  tree).  Using  these  ratings  (which  arc 
effectively  evaluation  functions  indicating  the  goodness  of 
the  work  done  so  far  with  respect  to  a given  partial  sen- 
tence), one  may  employ  tree-searching  strategies  to  ad- 
vance the  search  m some  optimal  manner. 


To  complicate  matters  further,  however,  HSJJ  is  in- 
tended to  he  a multipnieess-oiientrd  system  i lierefon  , 
schemes  must  he  devised  for  effectively  searching  a prob- 
lem-solving graph  using  parallel  processes,  since  one  can 
conceive  of  pursuing  several  branches  of  the  search  graph 
in  parallel  by  asynchronously  instantiating  various  KS 
processes  to  evaluate  various  alternative  paths.  One  must 
also  take  into  consideration  the  underlying  hardware 
architecture.  The  physical  placement  of  the  global  data 
base  and  the  KS  processes  will  have  a very  definite  influ- 
ence on  the  scheduling  philosophy  chosen  and  then  sultant, 
system  efficiency, 

'i'o  aid  in  making  scheduling  decisions,  one  may  associate 
with  every  node  in  the  global  data  base  some  attention- 
focusing  factors,  which  arc  indicators  telling  how  much 
effort  has  been  devoted  to  processing  in  this  area  of  the 
search  tree  and  how  desirable  it  is  to  devote  further  effort 
to  this  section  of  the  tree.  Such  attention-focusing  factors 
may  also  be  associated  with  various  speech  time  regions  to 
indi<  ite  interest  in  doing  further  processing  on  certain 
regions  of  the  utterance,  regardless  of  any  particular  in- 
formation level.  Furthermore,  attention-focusing  factors 
are  associated  with  every  data  base  modification,  thereby 
distributing  attention-focusing  factors  to  the  various 
change  sets  which  constitute  the  local  contexts  of  the 
processes  in  the  system.  The  scheduler  is  one  such  process 
which  might  be  especially  interested  in  such  focusing 
factors,  as  will  be  described  below.  The  use  of  the  various 
ratings  and  attention- focusing  factors  allows  HS11  to 
perform  goal-directed  scheduling,  which  is  process  schedul- 
ing so  as  to  achieve  “optimal”  recognition  efficiency.  The 
thrust  of  goal-directed  scheduling  is  that,  while  then'  may 
be  many  processes  ready  to  run  and  work  on  various  parts 
of  the  search  tree,  one  should  first  schedule  those  processes 
which  can  best  help  to  achieve  the  goal  of  utterance 
recognition.  Notice  that  such  search-tree  pruning  tech- 
niques as  the  alpha-beta  procedure  (which  is  essentially  a 
sequential  algorithm  anyway)  do  not  apply  to  HSJJ’s 
nongame  search  trees,  which  do  not  have  the  constraint, 
that  alternating  levels  of  the  tree  represent  the  moves  of 
an  opponent. 

Goal-directed  scheduling  may  he  viewed  as  having  two 
separate  functions:  1)  using  the  ratings  and  attention- 
focusing  factors  associated  with  the  global  data  base  com- 
ponents to  schedule  KS  processes  which  have  been  invoked 
(readied  for  execution)  in  response  to  events  previously 
detected  in  the  global  data  base,  and  2)  using  these  same 
attention-focusing  factors  to  detect  important  areas  hi  the 
global  data  base  which  require  further  work,  and  invoking 
precondition  evaluators  as  soon  as  possible  to  instantiate 
new  KS  processes  to  work  in  those  important  areas.  Thus, 
the  attention-focusing  factors  within  the  global  data  base 
serve  to  schedule  both  ICS  processes  and  precondition 
evaluators. 

A KS  process  might  be  scheduled  for  execution  because 
it  possesses  the  only  processing  capability  available  to  be 
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tftplicd  tn  nit  important  unexplored  nren  of  the  data  base. 
However  if  there  11  r<  ninny  such  processes  ready  to  exc- 
cilti  tin-  scheduler  can  perform  a type  of  mrnns-rmh 
.'inahris  in  which  those  J\S  processes  are  si'heduled  which 
nr<  life ! to  produce  data  base  chanties  in  which  the 
system  i'  currently  interested  (such  interest  being  noted 
|iv  hinli  attention-focusing  factors  in  a (riven  change  set), 
h'or  example,  if  tin  data  base  contains  focusing  factors 
which  highlight  netiv it y in  a time  region  in  w Inch  there  are 
no  structural  connections  between  two  adjoining  levels, 
the  scheduler  would  probably  give  a higher  priority  to  a 
KS  process  which  will  attempt  (as  indicated  in  its  external 
specifications)  to  make  such  a connection  than  to  a KS 
pre  ess  which  is  likely  merely  toperfonna  minor  refinement 
on  the  ratings  in  one  of  the  levels. 

\nother  means  of  eflectiug  goal  directed  scheduling 
relates  to  the  attention-focusing  factors  associated  with 
various  time  regions  of  the  utterance  (such  focusing  factors 
reaching  across  all  the  information  levels  of  the  global 
data  base).  I sing  these  t line-region  focusing  factors,  one 
can  schedule  KS  processes  which  can  contribute  in  a par- 
ticular time  region,  or  invoke  pr 'condition  evaluators  to 
instantiate  some  new  KS  processes  to  work  within  the 
desired  time  region. 

SIMM  ARY,  CURRENT  STATES,  AND  PLANS 

In  summary  MSI  I is  a system  organization  for  speech 
understanding  that  permits  the  representation  of  speech 
knowledge  in  terms  of  a large  number  of  diverse  KS’s 
which  cooperate  via  a generalized  (in  terms  of  both  data 
and  control)  form  of  the  hypothesize-and-test  paradigm. 
KS’s  are  independent  and  separable;  they  are  activated 
in  a data-directed  manner  and  execute  asynchronously, 
communicating  information  among  themselves  through  a 
global  data  base.  This  global  data  base,  which  is  a repre- 
sentation of  the  partial  analysis  of  the  utterance,  is  a three- 
dimensional  data  structure  (in  which  the  dimensions  are 
level  of  representation,  time,  and  alternatives)  augmented 
bv  and  on  structural  relationships  which  interconnect 
elements  of  the  data  structure.  This  global  data  base 
structure  permits  information  generated  by  one  KS  to  be: 
1)  retained  for  use  by  other  KS's,  and  2)  quickly  propa- 
gated to  other  relevant  part s of  the  data  base.  In  addition 
to  being  a new  representational  framework  for  specifying 
speech  knowledge,  HSII  is  a system  organization  suitable 
for  efficient  implementation  on  a multiprocessor  computer 
system.  In  particular,  the  system  organization  employs 
techniques  which,  while  not  violating  the  independence 
and  modularity  of  KS’s,  permit  1)  avoidance  of  deadlock 
in  the  data  base,  2)  efficient  implementation  of  data- 
directed  sequencing  of  knowledge  sources,  and  3)  goal- 
directed  scheduling  of  asynchronously  executable  KS 
processes. 

A preliminary,  synchronous  version  of  HSII  has  boon 
operating  on  Carnegie-Mellon  University's  PDl’-lO  since 
January  1974.  The  fully  asynchronous,  multiprocess  ver- 
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si, ,n  of  HSII  is  now  in  the  final  stages  of  being  imple- 
mented, also  on  the  I'DIMO,  and  is  expected  to  be  running 
by  August  1974.  This  nmltiprnccss  version  of  HSII  will 
aiso  contain  the  capability  of  simulating  the  effects  of 
operating  HSII  in  a multiprocessor  environment.  Experi- 
ence with  this  mnltiprocess  version  of  HSII,  together  with 
simulation  data  on  the  effects  of  operating  in  a multi- 
processor environment,  will  form  the  basis  for  a multi- 
processor version  of  IIS11  on  a multi  minipmeessor  com- 
puter ((’.inmp).  An  initial  implementation  of  HSII  on 
C.minp  is  expected  to  lie  completed  in  the  first  quarter  of 
197.ri. 

A Carnegie-Mellon  Uwversity  technical  report  pro- 
viding detailed  examples  f HSU’*  recognition  process  is 
planned  for  Autumn  1974 

APPENDIX  I 

HS!  PERFORMANCE  SUMMARY 

f ig.  7 summarizes  the  performance  of  1 1 SI  system 
as  of  Autumn  1973.  The  results  are  based  on  tests  of  144 
sentences  containing  <>7(>  words,  spoken  by  five  speakers 
and  using  four  different  tasks  (Chess,  News  Retrieval, 
Medical  Questionnaire,  and  Desk  Calculator)  whose 
vocabularies  range  from  2X  to  70  words.  (More  complete 
descriptions  of  the  tasks  are  given  in  [A]  and  [12].) 

Column  1 gives  the  data  set  number,  task,  and  speaker 
identification.  All  of  the  speakers  are  male  adults.  Data 
set  4 was  recorded  over  a long-distance  telephone  connec- 
tion; the  others  were  recorded  in  a fairly  quiet  environ- 
ment using  one  of  several  medium  to  high-quality  micro- 
phones. All  of  the  utterances  represent  speech  read  from 
a script  (as  opposed  to  being  spontaneous). 

Column  2 gives  the  number  of  words  in  the  task  lexicon. 

Column  3 shows  the  number  of  sentences  in  the  data 
set;  column  4 gives  the  total  number  of  word  tokens  in  the 
set. 

Column  ">  contains  the  results  of  recognition  with  just 
the  Acoustics  module  (the  acoustic-phonetic  source  of 
knowledge).  The  task  lexicon  is  the  sole  restriction — the 
Syntax  and  Semantics  modules  are  not  used.  The  first 
subcolumn  indicates  the  percent  of  times  the  correct  wind 
appears  as  the  first  choice  of  the  Acoustics  module  in  the 
hypothesis  list.  The  second  subcolumn  indicates  the  per- 
cent of  times  the  correct  word  is  in  the  top  two  choices; 
the  thiid  shows  the  top  three.  In  this  mode  each  incorrect 
word  recognition  is  overridden  by  specifying  the  correct 
word  boundary  positions  (using  carefully  hand-segmented 
data)  so  that  errors  do  not  propagate. 

Column  G gives  the  results  for  the  HSI  system  recogni- 
tion with  the  Acoustics  module  and  the  Syntax  module 
both  operating  The  first  subcolumn  indicates  the  percent 
of  sentences  ’■ecognized  completely  correctly.  The  “near- 
miss”  (indicated  below  that  number  in  the  first  subcolumn) 
indicates  the  percent  of  times  the  recognized  utterance 
differed  from  the  actual  utterance  by  at  most  one  word  of 
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approximately  similar  phonetic  structure  The  second  sub- 
column  gives  the  percent  of  words  recognized  correctly. 
The  mean  computation  times  on  the  PDI’-M)  computer 
(in  seconds  per  sentence  and  in  seconds  per  second  of 
speech)  are  shown  in  subcol  mins  time  and  four. 

Column  7 shows  results  for  recognition  using  all  three 
sources  of  knowledge  (for  the  Chess  task  only),  the  Acous- 
tics, Syntax  and  Semantics  modules.  The  subcoluuins  are 
similar  to  those  of  Column  (i. 

As  one  might  expect,  accuracy  increases  with  the  intro- 
duction of  more  KS’s.  It  is  also  true  that  computation  time 
decreases  with  more  KS’s;  i.c.,  the  additional  KS’s  serve 
to  reduce  the  size  of  the  space  that  must  be  searched. 

Performance  differences  across  the  different  data  sets 
are  attributable  to  the  following  several  sources: 

1)  Task  lexicon  size. 

Phonetic  content  of  the  lexicons:  many  rules  in  the 
Acoustics  module  handle  particular  kinds  of  phonetic 
juxtaposition.  Vocabularies  as  small  as  the  ones  used  here 
do  not  exhaust  all  of  these  cases;  each  such  vocabulary 
exhibits  some  eases  not  found  in  each  of  the  others,  Be- 
cause a larger  amount  of  effort  has  gone  into  the  rules 
based  on  the  Chess  lexicon,  that  task  also  performs  best, 
d)  Speaker  differences. 

It  is  encouraging  that  the  performance  using  the  tele- 
phone input  (data  set  4)  does  not  appear  to  differ  signi- 
ficantly from  the  other  data  sets,  which  use  higher  quality 
input  Tin  labeling  of  acoustic  segments  is  defined  by  a 
table  of  values  derived  from  training  syllables  uttered  bv 
the  speaker  over  the  same  channel  used  to  input  the  test, 
data;  thus  it  is  somewhat  self-normalizing.  Also,  the  system 
detects  that  there  is  very  little  higher  frequency  energy  (in 
the  case  of  the  telephone  input ) ; t his  automatically  triggers 


programmed  readjustment  of  some  thresholds  tual  heuris- 
tic,^ dealing  with  high  frequency  information. 
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INTRODUCTION 


Miny  A1  problem-solving  tasks  require  large  amounts  of  processing  power  in 
order  to  achieve  solution  in  any  given  computer  implementation  of  a problem-solving 
strategy.  The  amount  of  processing  power  required  is  directly  related  to  the  size  of 
the  search  space  which  is  examined  during  the  course  of  problem  solution.  Exhaustive 
search  of  the  state  space  associated  with  almost  any  problem  of  interest  is  precluded 
due  to  the  sheer  size  of  the  state  space.*  !n  most  problem-solving  attempts,  heuristics 
are  employed  which  prune  the  search  space  to  a more  manageable  size.  However, 
searching  even  the  reduced  state  space  often  requires  large  amounts  of  processing 
power.  The  demand  for  sufficient  computing  power  becomes  critical  in  tasks  requiring 
real-time  solution,  as  is  the  case  in  the  speech-understanding  task  with  which  this  paper 
is  primarily  concerned,  For  example,  a speech-understanding  system  capable  of  reliably 
understanding  connected  speech  involving  a large  vocabulary  and  spoken  by  multiple 
speakers  is  likely  to  require  from  10  to  100  million  instructions  per  second  of  computing 
power,  if  the  recognition  is  to  performed  in  real  time.^  Recent  trends  in  technology 
suggest  that  this  computing  power  can  be  economically  obtained  through  a closely- 
coupled  network  of  asynchronous  "simple"  processors  (involving  perhaps  10  to  100  of 
these  processors),  (Bell,  et  al,  1973,  and  Heart,  et  al,  1973).  The  major  problem  (from 
the  problem-solving  point  of  view)  with  this  network  multiprocessor  approach  for 
generating  computing  power  is  in  devising  the  various  problem-solving  algorithms  in 
such  a way  as  to  exhibit  a structure  appropriate  for  exploiting  the  parallelism  available 
in  the  multiprocessor  network,  for  it  is  only  by  taking  advantage  of  this  processing 
parallelism  that  the  desired  effective  computing  power  will  be  achieved. 

The  Hearsay  II  speech-understanding  system  (HSII)  (Lesser,  et  aL  1 974; 
Fennell,  1975;  and  Erman  and  Lesser,  1975)  currently  under  development  at  Carnegie- 
Mellon  University  represents  a problem-solving  organization  that  can  effectively  exploit 
a multiprocessor  system.  HSII  has  been  designed  as  an  Al  system  organization  suitable 
for  expressing  knowledge-based,  problem-solving  strategies  in  which  appropriately 

* As  an  example,  consider  the  chess-playing  task.  In  an  end  game  situation,  there  are 
typically  20  legal  moves  at  each  ply  (half-move);  so  for  a search  depth  of  6 plies,  the 
search  space  will  have  64  million  branches. 

^ The  Hearsay  I (Reddy,  et  al.,  1973a, b,c  and  Erman,  1974)  and  Dragon  (Baker,  1975) 
speech  understanding  systems  require  approximately  10  to  20  mips  of  computing 
power  for  real-time  recognition  when  handling  small  vocabularies. 
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organized  subject-matter  knowledge  may  be  represented  as  knowledge  sources  capable 
of  contributing  Iheir  knowledge  in  a parallel  data-directed  fashion.  A knowledge  source 
may  be  described  as  an  agent  that  embodies  the  knowledge  of  a particular  aspect  of  a 
problem  domain  and  is  useful  in  solving  a problem  from  that  domain  by  performing 
actions  based  upon  its  knowledge  so  as  to  further  the  progress  of  the  overall  solution. 
It  is  felt  that  the  knowledge  source  is  an  appropriate  unit  for  use  in  the  decomposition 
of  a knowledge-intensive  task  domain.  Knowledge  sources,  being  suitably  organized 
capsules  of  subject-matter  knowledge,  may  be  independently  formulated  as  various 
pieces  of  the  knowledge  relevant  to  a task  domain  become  crystallized  The  HSII 
system  organization  allows  these  various  independent  and  diverse  sources  of  knowledge 
to  be  specified  and  their  interactions  coordinated  so  they  might  cooperate  with  one 
another  (perhaps  asynchronously  and  in  parallel)  to  effect  a problem  solution.  As  an 
example  of  the  decomposition  of  a task  domain  into  knowledge  sources,  in  the  speech 
task  domain  there  might  be  distinct  knowledge  sources  to  deal  with  acoustic,  phonetic, 
lexical,  syntactic,  and  semantic  information.  While  the  speech  task  is  the  first  test  of  the 
multiprocessing  problem-solving  organization  of  HSU,  it  is  believed  that  the  system 
organization  provided  by  HSII  is  capable  of  expressing  o-her  knowledge-based  AI 
problem-solving  strategies,  as  might  be  found  in  vision,  robotics,  chess,  natural  language 
understanding,  and  protocol  analysis.  In  fact,  proposals  are  under  way  which  will 
further  test  the  applicability  of  HSII  by  implementing  a system  for  the  analysis  of 
natural  scenes  using  the  M problem-solving  organization  (Ohlander,  1975). 

The  rest  of  this  paper  will  explore  several  of  the  ramifications  of  such  a 
problem-solving  organization  by  examining  the  mechanisms  and  policies  underlying  HSII 
which  are  necessary  for  supporting  its  organization  as  a multiprocessing  problem- 
solving system.  First,  an  abstract  description  of  a class  of  problem-solving  systems  is 
given  using  the  Production  System  model  of  Newell  (1973).  Then,  the  HSII  problem- 
solving organization  is  described  in  terms  of  this  model.  The  various  decisions  made 
during  the  course  of  design  necessitated  the  introduction  of  various  multiprocessing 
mechanisms  (e.g.,  mechanisms  for  maintaining  data  localization  and  data  integrity),  and 
these  mechanisms  are  discussed.  Finally,  a simulation  study  is  presented  which  details 
the  effects  of  actually  implementing  such  a problem-solving  organization  in  a 
multiprocessor  environment. 
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THE  MODEL 


An  Abstract  Model  for  Problem  Solving 

In  the  abstract,  the  problem-solving  organization  underlying  HS II  may  be 
modeled  in  terms  of  a production  system,"  (Newell,  1973).  A production  system  is  a 
scheme  for  specifying  an  information  processing  system  in  which  the  control  structure 
of  the  system  is  defined  by  operations  on  a set  of  productions  of  the  form  'P  -»  A’, 
which  operate  from  and  on  a collection  of  data  structures.  'P*  represents  a logical 
antecedent,  called  a precondition,  which  may  or  may  not  be  satisfied  by  the  information 
encoded  within  the  dynamically  current  set  of  data  structures.  If  *P’  is  found  to  be 
satisfied  by  some  data  structure,  then  the  associated  action  'A'  may  be  executed,  which 
presumably  will  have  some  altering  effect  upon  the  data  base  such  that  some  other  (or 
the  same)  precondition  becomes  satisfied.  This  paradigm  for  sequencing  of  the  actions 
can  be  thought  of  as  a data-directed  control  structure,  since  the  satisfaction  of  the 
precondition  is  dependent  upon  the  dynamic  state  of  the  data  structure.  Productions 
are  executed  as  long  as  their  antecedent  preconditions  are  satisfied,  and  the  process 
halts  either  when  no  precondition  is  found  to  be  satisfied  or  when  an  action  executes  a 
stop  operation  (thereby  signalling  problem  solution  or  failure,  in  the  case  of  problem- 
solving systems). 


The  HSII  Problem-Solving  Organization:  A Production  System  Approach 

The  HSII  system  organization,  which  can  be  characterized  as  a "parallel" 
production  system,  has  a centralized  data  base  which  represents  the  dynamic  problem 
solution  state.  This  data  base,  which  is  known  as  the  blackboard,  is  a multidimensional 
data  structure  which  is  readable  and  writab.e  by  any  precondition  or  knowledge-source 
process  (where  a knowledge-source  process  is  the  embodiment  of  a production 
action).1  Preconditions  are  procedurally  oriented  and  may  specify  arbitrarily  complex 
tests  to  be  performed  on  the  data  structure  in  order  to  decide  precondition  satisfaction. 

1 As  an  example,  the  dimensions  of  the  HSII  speech-understanding  system  data  base 
are  informational  level  (e.g.,  acoustic  level,  phonetic  level,  and  word  level),  utterance 
time  (speech  time  measured  from  the  beginning  of  the  input  utterance),  and  data 
alternatives  (where  multiple  hypotheses  are  permitted  to  exist  simultb  teously  at  the 
same  level  and  utterance  time).  For  additional  details,  see  Appendix  A. 
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Preconditions  are  themselves  data-directed  in  that  they  are  tested  for  satisfaction 
whenever  relevant  changes  occur  in  the  data  base;*  and  simultaneous  precondition 
satisfaction  is  permitted.  Testing  for  precondition  satisfaction  is  not  presumed  to  be  an 
instantaneous  or  even  an  indivisible  operation,  and  several  sxh  precondition  tests  may 
proceed  concurrently. 

The  knowledge-source  processes  representing  the  production  actions  are  also 
procedurally  oriented  and  may  specify  arbitrarily  complex  sequences  of  operations  to 
be  performed  upon  the  data  stru;ture.  The  overall  effect  of  any  given  knowledge- 
source  process  is  usually  either  to  hypothesize  new  data  which  is  to  be  added  to  the 
data  base  or  to  verify  (and  pe  haps  modify)  data  previously  placed  in  the  data  base. 
This  follows  the  general  hj  pothesize-ancl-test  problem-solving  paradigm  wherein 
hypotheses  representing  parti  al  problem  solutions  are  generated  and  then  tested  for 
validity;  this  cycle  continues  until  the  verification  phase  certifies  the  completion  of 
processing  (and  either  the  problem  is  solved  or  failura  is  indicated).  The  execution  of  a 
knowledge-source  process  is  usually  temporally  disjoint  from  the  satisfaction  of  its 
precondition;  the  execution  of  any  given  knowledge-source  process  is  not  presumed  to 
be  indivisible;  and  the  concurrent  execution  of  multiple  knowledge-source  processes  is 
permitted.  In  addition,  a precondition  process  may  invoke  multiple  instantiations  of  a 
knowledge  source  to  work  on  the  different  parts  of  the  blackboard  which  independently 
satisfy  the  precondition’s  pattern.  Thus,  the  independent  data-directed  nature  of 
precondition  evaluation  and  knowledge-source  execution  can  potentially  generate  a 
significant  amount  of  parallel  activity  through  the  concurrent  execution  of  different 
preconditions,  different  knowledge  sources,  and  multiple  instantiations  of  a single 
knowledge  source. 

* In  effect,  preconditions  themselves  have  preconditions,  call  them  "pre-preconditions." 
In  HSII,  knowledge-source  preconditions  (which  correspond  to  action  preconditions  in 
the  production  system  model)  may  be  arbitrarily  complex.  In  order  to  avoid  executing 
these  precondition  tests  unnecessarily  often,  they  in  turn  have  pre-preconditions 
which  are  essentially  monitors  on  relevant  primitive  data  base  events  (e.g.,  monitoring 
for  a change  to  a given  field  of  a given  node  in  the  data  base,  or  a given  field  of  any 
node  in  the  data  base).  Whenever  any  of  these  primitive  events  occurs,  those 
preconditions  monitoring  such  events  are  awakened  and  allowed  to  test  for  full 
precondition  satisfaction.  These  data  events  are  used  by  the  precondition  process  as 
pointers  to  the  specific  parts  of  the  data  base  which  may  now  satisfy  the  pattern  the 
precondition  is  monitoring  for.  During  the  period  between  when  the  precondition 
process  has  been  first  awakened  and  the  time  it  is  executed,  the  monitoring  for 
relevant  data  base  events  continues.  Thus,  a precondition  process,  when  finally 
executed,  may  check  more  than  one  part  of  the  data  base  for  satisfaction. 


Parallelism  in  HS  II  43 


% 


The  basic  structure  and  components  of  the  HSII  organization  may  be  depicted 
as  shown  in  the  message  transaction  diagram  of  Figure  1.  The  diagram  indicates  the 
paths  of  active  information  flow  between  the  various  components  of  the  problem- 
solving system  as  solid  arrows;  paths  indicating  control  activity  are  shown  as  broken 
arrows.  The  major  components  of  the  diagram  include  a passive  global  data  structure 
(the  blackboard)  which  contains  the  current  s.ate  of  the  problem  solution.  Access  to  the 
blackboard  is  conceptually  centralized  in  the  blackboard  handler  module,  1 whose 
primary  function  is  to  accept  and  honor  requests  from  the  active  processing  elements  to 


lead  and  write  parts  of  the  blackboard.  The  active  processing  elements  which  pose 
these  data  access  requests  consist  of  knowledge- source  processes  and  their  associated 
preconditions . Preconditions  are  activated  by  a blackboard  monitoring  mechanism 
which  monitors  the  various  write-actions  of  the  blackboard  handler;  whenever  an  event 
occurs  which  is  of  interest  to  a particular  precondition  process,  that  precondition  is 
activated.  If  upon  further  examination  of  the  blackboard,  the  precondition  finds  itself 
"satisfied,"  the  precondition  may  then  request  a process  instantiation  of  its  associated 
knowledge  source  to  be  established,  passing  the  details  of  how  the  precondition  was 
satisfied  as  parameters  to  this  instantiation  of  the  knowledge  source.  Once  instantiated, 
the  knowledge-'.ource  process  can  respond  to  the  blackboard  data  condition  which  was 
detected  by  its  precondition,  possibly  requesting  further  modifications  be  made  to  the 


blackboard,  perhaps  thereby  triggering  further  preconditions  to  respond  to  the  latest 
modifications.  This  particular  characterization  of  the  HSII  organization,  while  certainly 
overly  simplified,  shows  the  data-driven  nature  of  the  knowledge  source  activations  and 
interactions. 


The  following  sections  of  this  paper  will  attempt  to  refine  this  diagram  of  the 
HSII  organization  by  pointing  out  the  difficulties  that  arise  from  this  oversimplified 
representation  of  the  organization  and  by  supplementing  the  various  components  of  this 
simple  diagram  to  resolve  these  problems  and  result  in  a more  complete  organization  for 
AI  problem-solving  in  multiprocessing  environments.  A more  complete  message 
transaction  diagram  for  HSII  will  be  presented  in  a subsequent  section. 


1 The  blackboard  handler  module  could  be  Implemented  either  as  a procedure  which  is 
called  as  a subroutine  from  precondition  and  knowledge  source  processes,  or  as  a 
process  which  contains  a queue  of  requests  for  blackboard  access  and  modification 
sent  by  precondition  and  knowledge  source  processes.  In  the  implementation 
discussed  in  this  paper,  the  blackboard  handler  module  is  implemented  as  a 
subroutine. 
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Figure  1.  Simplified  HSII  System  Organization 


HEARSAY  II  MULTIPROCESSING  MECHANISMS 

Given  the  decision  that  multiple  preconditions  may  be  simultaneously  satisfied 
and  that  multiple  knowledge-source  processes  may  execute  concurrently,  various 
mechanisms  must  be  provided  to  accommodate  such  a multiprocessing  environment. 
Mechanisms  must  be  provided  to  support  tl  e individual  localized  executions  of  the 
various  active  and  ready  processes  and  to  keep  the  processes  from  interfering  with  one 
another,  either  directly  or  indirectl On  the  other  hand,  mechanisms  must  also  be 
provided  so  that  the  various  active  processes  may  communicate  with  one  another  so  as 
to  a;hieve  the  desired  process  cooperation.  Since  the  various  constituent  knowledge 
sources  are  assumed  to  be  independently  developed  and  are  not  to  presume  the  explicit 
existence  of  other  knowledge  sources,  communication  among  these  knowledge  sources 
must  necessarily  be  indirect.  The  desire  for  a modular  knowledge  source  structure 
arises  from  the  fact  that  usually  many  different  people  are  involved  in  the 
implementation  of  the  set  of  knowledge  sources,  and,  for  purposes  of  experimentation 
and  knowledge  source  performance  analysis,  the  system  should  be  able  to  be  easily 
reconfigured  with  alternative  subsets  of  knowledge  sources.  This  communication  takes 
two  primary  forms:  data  base  monitoring  for  collecting  pertinent  data  event  information 
for  future  use  (Local  contexts  and  precondition  activation),  and  data  base  monitoring  for 
the  occurrence  of  data  events  which  violate  prior  data  assumptions  (tags  and  messages). 
The  following  paragraphs  will  discuss  these  forms  of  data  base  monitoring  and  their 
relationship  to  the  data  access  synchronization  mechanisms  required  in  a multiprocess 
system  organization. 


Local  Contexts 

Interprocess  communication  (and  interference)  among  knowledge  sources  and 
their  associated  preconditions  occurs  mainly  via  the  global  data  base,  as  a result  of  the 
design  decisions  involved  in  trying  to  maintain  process  independence.  It  is  therefore 
not  surprising  that  the  mechanisms  necessary  to  bring  about  the  desired  process 
cooperation  and  independence  are  based  on  global  data  base  considerations.  The  global 
data  base  (the  blackboard)  is  intended  to  contain  only  dynamically  current  information. 
Since  preconditions  (being  data-directed)  are  to  be  tested  for  satisfaction  upon  the 
occurrence  of  relevant  data  base  changes  (which  are  historical  data  events),  and  since 
neither  precondition  testing  nor  action  execution  (nor  the  sequential  combination  of  the 
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two)  Is  assumed  to  be  an  indivisible  operation,  localized  data  bases  must  be  provided 
for  each  process  unit  (precondition  or  action)  which  needs  to  remember  relevant 
historical  data  events.  These  localized  databases,  called  Local  contexts  in  HSH,  which 
record  the  changes  to  the  blackboard  since  the  precondition  process  was  last  executed 
or  the  knowledge  source  process  was  created  provide  personalized  operating 
environments  for  the  various  precondition  and  knowledge-source  processes.  A local 
context  preserves  only  those  data  events*  and  state  changes  relevant  to  its  owner. 
The  creation  time  of  the  local  context  (i.e.,  the  time  from  which  it  begins  collecting  data 
events)  is  also  dependent  upor  the  context  owner.  Any  given  local  context  is  built  up 
incrementally:  when  a modification  occurs  to  the  global  data  base,  the  resulting  data 
event  is  distributed  to  the  various  local  contexts  interested  in  such  events.  The  various 
primitive  data  modification  routines  (or  node  creation  routines)  are  responsible  for  the 
distribution  of  the  data  events  which  result  from  the  modification,  just  as  these 
modification  routines  are  also  responsible  for  sending  warning  messages  to  those 
processes  which  want  to  be  notified  when  specific  characterise  s of  a particular  node 
are  altered.2  Thus,  the  various  local  contexts  retain  a history  of  relevant  data  events, 
while  the  global  data  base  contains  Only  the  most  current  information. 


Data  Integrity 

Since  precondition  and  knowledge-source  processes  are  not  guaranteed  to  be 
executed  uninterruptedly,  these  processes  often  need  to  sssure  the  integrity  of  various 
assumptions  they  are  making  about  the  contents  of  the  data  base;  for  should  these 
assumptions  become  violated  due  to  the  actions  of  an  intervening  process,  the  further 
computation  of  the  assuming  process  may  have  to  be  altered  for  terminated).  One  way 
to  approach  the  problem  of  data  integrity  is  to  guarantee  the  validity  of  data 
assumptions  by  disallowing  intervening  processes  the  ability  to  modify  (or  perhaps  even 
to  examine)  critical  data.  In  order  to  guarantee  the  integrity  of  data  through  the 
mechanism  of  exclusive  access,  the  HSIl  system  provides  two  forms  of  locking  primitives, 
node-  and  region-lockint,  Node-locking  guarantees  exclusive  access  to  an  explicitly 

1 The  information  which  defines  a data  event  consists  of  the  locus  of  the  event  (i.e.,  a 
data  node  name  and  a field  name  within  that  node)  and  the  old  value  of  the  field  (the 
new  value  being  stored  in  the  global  data  base). 

2 The  use  of  these  warning  messages  as  way  of  preserving  data  integrity  will  be 
discussed  in  the  next  section. 
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specified  node  in  the  blackboard,  whereas  region-locking  guarantees  exclusive  access  to 
a collection  of  nodes  that  are  specified  implicitly  based  on  a set  of  node  characteristics. 
In  the  current  implementation  of  HSII,  the  region  characteristics  are  specified  by  a 
parlicular  information  level  and  time  period  of  a node.  If  the  blackboard  is  considered 
as  a two  dimensional  structure  with  coordinates  of  information-level  and  time,  then 
region-locking  permits  the  locking  of  an  arbitrary  rectangular  area  in  the  blackboard. 
Region-locking  has  the  additional  property  of  preventing  the  creation  of  any  new  node 
that  would  be  placed  in  the  blackboard  area  specified  by  the  region  by  other  than  the 
process  which  had  requested  the  region-lock.  Additional  locking  flexibility  is  introduced 
by  allowing  processes  to  request  read-only  access  to  data  fields;  this  reduces  possible 
contention  by  permitting  multiple  readers  of  a given  field  to  coexist,  while  excluding  any 
writers  of  that  field  until  all  readers  are  finished.  The  system  also  provides  a "super 
lock,"  which  allows  an  arbitrary  group  of  nodes  and  regions  to  be  locked  at  the  same 
time.  A predefined  linear  ordering  s’^ategy  for  non-preemptive  data  access  allocation 
(Coffman,  et  al.,  1971)  is  applied  by  the  "super  lock"  primitive  to  the  desired  node-  and 
region-locks  so  as  to  avoid  the  possibility  of  data  base  deadlock. 

However,  this  technique  of  guaranteeing  data  integrity  through  exclusive 
access  is  only  applicable  if  all  the  nodes  and  regions  to  be  accessed  and  modified  are 
known  ahead  of  time.  The  sequential  acquisition  of  exclusive  access  to  nodes  and 
region,  without  intervening  unlocks,  can  result  in  the  possibility  of  deadlock.  In  the  HSII 
blackboard,  nodes  are  interconnected  to  form  a directed  graph  structure;  because  it  is 
possible  to  establish  an  arbitrarily  complex  interconnection  structure,  it  is  often  very 
difficult  for  a knowledge -source  process  to  anticipate  the  sequence  of  nodes  it  will 
desire  to  access  or  modify.  Thus,  the  mechanisms  of  exclusive  access  cannot  always  be 
used  to  guarantee  data  integrity  in  a system  with  a complex  data  structure  and  a set  of 
unknown  processes.  Further,  even  if  the  knowledge  source  can  anticipate  the  area  in 
the  blackboard  within  which  it  will  work  and  thereby  request  exclusive  access  to  this 
area,  the  area  may  be  very  large,  thus  leading  to  a significant  decrease  in  potential 
parallel  activity  caused  by  other  processes  waiting  for  this  locked  area  to  become 
available. 

An  alternative  approach  to  guaranteeing  data  integrity  is  to  provide  a means 
by  which  a process  (precondition  or  knowledge  source)  may  place  data  assumptions 
about  the  particular  state  of  a node  or  group  of  nodes  in  the  data  base  (the  action  of 
putting  these  assumptions  in  the  blackboard  is  called  ragging).  If  these  assumptions  are 
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invalidated  by  a subsequent  blackboard  modification  operation  of  another  process,  then 
a message  indicating  this  violation  is  sent  to  the  process  making  the  assumption.  In  the 
meantime,  the  assuming  process  can  proceed  without  obstructing  other  processes,  until 
such  time  as  it  intends  to  modify  the  data  base  (since  data  base  modification  is  the  only 
way  one  process  can  affect  the  execution  of  another).  The  process  must  then  acquire 
exclusive  access  to  the  parts  of  the  data  base  involved  in  its  prior  assumptions  (which 
parts  will  have  been  previously  tagged  in  the  data  base  to  define  a critical  data  sat)^ 
and  check  to  see  whether  the  assumptions  have  been  violated  (in  which  case,  messages 
indicating  those  violations  would  have  been  sent  to  the  process).  If  a violation  has 
occurred,  the  assuming  process  may  wish  to  take  alternative  action;  otherwise,  the 
intended  data  base  modifications  may  be  made  as  if  the  process  had  had  exclusive 
access  throughout  its  computation.  This  tagging  mechanism  can  also  be  used  to  signal 
the  knowledge-source  process  that  the  initial  conditions  in  the  blackboard  (i.e.,  the 
precondition  pattern)  that  caused  the  precondition  to  invoke  it  have  been  modified;  this 
is  accomplished  by  having  the  precondition  tag  these  initial  conditions  on  behalf  of  the 
knowledge-source  process  prior  to  the  instantiation  of  the  knowledge  source. 

In  summary,  the  HS1I  organization  provides  mechanisms  to  accomplish  both  of 
these  forms  of  data  integrity  assurance:  the  various  data  base  locking  mechanisms 
described  previously  provide  several  forms  of  exclusive  or  read-only  data  access;  and 
the  data  tagging  facility  allows  data  assumptions  to  be  placed  in  the  data  base  without 
interfering  with  any  process’  ability  to  access  or  modify  that  area  of  the  data  base  (with 
data  invalidation  warning  messages  being  sent  by  data  base  monitors  whenever  the 
assumptions  are  violated). 

To  provide  a basis  for  the  discussion  in  the  subsequent  sections  of  this  paper, 
Figure  2,  depicting  the  various  components  of  the  HSII  organizational  structure,  is 
offered.  The  diagram  is  a more  detailed  version  of  the  message  transaction  model 
presented  previously.  The  new  components  of  this  diagram  are  primarily  a result  of 
addressing  multiprocessing  considerations. 

As  in  the  earlier,  more  simplified  organizational  diagram,  the  dynamically 
current  state  of  the  problem  solution  is  contained  in  a centralized,  shared  data  base, 
called  the  blackboard.  The  blackboard  not  only  contains  data  nodes,  but  it  also  records 

Actually,  the  requirement  is  that  no  other  process  be  able  to  write  to  these  parts  of 

the  data  base. 
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data  monitoring  information  (fags)  and  data  access  synchronization  information  {Locks), 
Access  to  the  blackboard  is  conceptually  centralized  in  three  modules.  As  before,  the 
blackboard  handler  module  accepts  and  honors  read  and  write  data-access  requests 
from  the  active  processing  elements  (the  knowledge-source  processes  and  their 
precondition  processes),  A lock  handler  coordinates  data-access  synchronization 
requests  from  the  knowledge-source  processes  and  preconditions,  with  the  ability  to 
block  the  progress  of  the  requesting  process  until  the  synchronization  request  may  be 
satisfied.  A monitoring  mechanism  is  responsible  for  accepting  data  tagging  requests 
from  the  knowledge-source  processes  and  preconditions,  and  for  sending  messages  to 
the  tagging  processes  whenever  a tagged  data  field  is  modified.  It  is  also  the 
responsibility  of  the  monitoring  mechanism  to  distribute  data  events  to  the  various  local 
contexts  of  the  knowledge-source  processes  and  preconditions,  as  well  as  to  activate 
precondition  processes  whenever  sufficient  data  events  of  interest  to  those 
preconditions  have  occurred  in  the  blackboard, 

Associated  with  each  active  processing  element  is  a local  data  base,  the  local 
context,  which  records  data  events  that  have  occurred  m the  blackboard  rrna  are  of 
interest  to  that  particular  process,  The  local  contexts  may  be  read  V>  (heii  associated 
processes  in  order  to  find  out  which  data  nodes  have  been  modified  recently  and  what 
the  previous  values  of  particular  data  fields  were.  The  local  contexts  are  automatically 
maintained  by  the  blackboard  monitoring  mechanism. 

Upon  being  activated  and  satisfied,  precondition  processes  may  in  fe  a 

knowledge  source  (thereby  creating  a knowledge- sor ice  pre  ■ passing  along  the 
reasons  for  f his  instantiation  as  parameters  to  the  new  knowle  ^-source  process  and 
at  the  same  time  establishing  the  appropriate  data  monitoring  connections  necessary  for 
the  new  process.  The  goal-directed  scheduler  retains  the  actual  control  over  allocating 
hardware  processing  capability  to  those  knowledge-source  processes  and  precondition 
processes  which  can  best  serve  to  promote  the  progress  of  the  problem  solution.* 


* One  way  a scheduler  might  help  in  reducing  (or  eliminating)  global  data  base  access 
interference  is  to  schedule  to  run  concurrently  only  processes  whose  global  data 
demands  are  disjoint.  Such  a scheduling  policy  could  even  be  used  to  supplant  an 
explicit  locking  scheme,  since  the  global  data  base  locking  would  be  effectively 
handled  by  the  scheduler  (albeit  probably  on  a fairly  gross  level).  Of  course,  other 
factors  may  rule  out  such  an  approach  to  data  access  synchronization,  wh  as  an 
inability  to  make  maximal  use  of  the  available  processing  resources  if  only  data- 
disjoint  processes  are  permitted  to  run  concurrently,  or  the  inability  to  know  in 
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EXPERIMENTS  WITH  AN  IMPLEMENTATION 


The  preceding  sections  of  th".  paper  have  presented  various  of  the 
mechanisms  necessary  in  implementing  a knowledge-based  problem-solving  system  such 
as  HSII  in  a multiprocessing  environment.  The  present  sections  will  discuss  the  various 
experiments  that  have  been  performed  in  an  attempt  to  characterize  the  multiprocessing 
performance  of  the  HSU  organization  in  the  speech-understanding  task. 


HSII  Multiprocess  Performance  Analysis  through  Simulation 

In  order  to  gain  insight  into  the  various  efficiency  issues  involving 
mulliprocess  problem-solving  organizations,  a simulation  model  war  incorporated  within 
the  uniprocessor  version  of  the  HSII  speech-understanding  system.  The  HSII  problem- 
solving organization  was  not  itself  modeled  and  simulated,  but  rather  the  actual  HSII 
implementation  (which  is  a multiprocessing  organization  even  when  executing  on  a 
uniprocessor)  was  modified  to  permit  the  simulation  of  a hardware  multiprocessor 
environment. 

There  were  four  primary  objectives  of  the  simulation  experiments:  a)  to 

measure  the  software  overheads  involved  in  the  design  and  execution  of  a complicated, 
data-directed  multiprocess(or)  control  structure  b)  to  determine  whether  there  really 
exists  a significant  amount  of  parallel  activity  m the  speech-understanding  task,  c)  to 
understand  how  the  various  forms  of  interprocess  communication  and  interference, 
especially  that  from  data  access  synchronization  in  the  blackboard,  affect  the  amount  of 
effective  parallelism  realized,  and  d)  to  gain  insight  into  the  design  of  an  appropri  .te 
scheduling  algorithm  for  a muitiprocess  problem-solving  structure.  Certainly,  any 
results  presented  will  reflect  the  detailed  efficiencies  and  inefficiencies  of  the  particular 
system  implementation  being  measured,  but  hopefully  the  organization  of  HSII  is 
sufficiently  general  that  the  various  statements  will  have  a wider  quantitative 
applicability  for  those  considering  similar  multiprocess  control  structures. 

By  way  of  summary,  the  primary  characteristics  of  the  HSII  organization 

advance  the  precise  blackboard  demands  of  each  knowledge-source  instantiation. 
Nonetheless,  the  information  relating  to  the  locality  of  knowledge-source  data 
references  is  useful  in  scheduling  processes  so  as  to  avoid  exc^sive  data  access. 
interference  (thereby  improving  the  effective  parallelism  of  the  system). 
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include:  a)  multiple,  diverse,  independent  and  asynchronously  executing  Knowledge 

sources,  b)  cooperating  (in  terms  of  control)  via  a generalized  form  of  the  hypothesize- 
and-test  paradigm  involving  the  data-directed  /ocation  of  knowledge-source 

processes,  and  c)  communicating  (in  terms  of  da  > a shared  blackboard-like  data 

base  in  which  the  current  data  state  is  held  in  a homogenecus,  multidimensional, 
directed-graph  data  structure. 


The  HSII  Speech  Understanding  System:  The  Simulation  Configuration 

The  configuration  of  the  HSH  speech-understanding  system,  upon  which  the 
following  simulation  results  were  based,  consists  of  eight  separate  generic  knowledge 
sources  (each  of  which  may  be  realized  by  several  active  instantiations  at  any  given 
moment  during  the  problem  solution),  each  of  which  repr°sents  some  body  of  knowledge 
re'evant  to  the  speech-understanding  task.  Due  to  the  excessive  cost  of  the  simulation 
effort  (and  due  to  the  limited  stages  of  development  of  some  available  knowledge 
sources),  only  a subset  of  the  available  knowledge  sources  was  actually  used  in  the 
simulation  experiments.  Appendix  A (which  was  extracted  from  (Lesser,  et  a L,  1974)) 
contains  a more  detailed  description  of  the  blackboard  and  the  various  knowledge 
sources  for  the  more  complete  HSII  speech-understanding  system.  The  knowledge 
sources  used  in  the  simulation  were:  the  Segment  Classifier,  the  Phone  Synthesizer 
(consisting  of  two  knowledge  sources),  the  Phoneme  Hypothesizer,  the  Phone-Phoneme 
Synchronizer  (consisting  of  three  knowledge  sources),  and  (he  Rating  Policy  Module. 
These  knowledge  sources  are  activated  by  half  a dozen  precondition  processes  (which 
are  permanently  instantiated  in  the  system),  which  are  continuously  monitoring  the 
blackboard  data  base  for  events  and  data  patterns  relevant  to  their  associated 
knowledge  sources.  Both  knowledge  sources  and  preconditions  may  freely  access  the 
centralized  blackboard  data  base,  which  consists  of  nine  lexicon  levels.*  The  particular 
levels  used  were  chosen  so  as  to  facilitate  the  information  exchange  between  the 
various  component  knowledge  sources. 

This  set  of  knowledge  sources  and  preconditions  and  the  associated  operating 
system  facilities  provided  by  the  HSII  organization  were  first  implemented  to  execute  on 

* While  there  are  eight  conceptual  information  levels  within  the  HSII  speech- 
understanding system  (see  Appendix  A),  the  blackboard  is  abstractly  segmented 
according  to  lexicons,  rather  than  information  levels,  since  lexicons  allow  a finer 
abstract  decomposition  of  the  blackboard. 
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, uniprocessor  DECsystem-10  computer.  The  particular  implementation  represented 
here  was  programmed  in  the  Algol-like  language,  SAIL  (Swinehart  and  Sproull,  1971), 
using  SAIL’S  multiprocessing  facilities  (Feldman,  a.  «i,  1972)  and  making  extensive  use 
of  its  LEAP  associative  data  storage  facility  (Feldman  and  Rovner,  1969).  Thus,  while 
the  hardware  environment  of  this  version  of  the  HSU  speech-understanding  system  ,s 
that  of  a single  processor,  the  software  environment  is  the  mult, processing  structure 
described  throughout  this  paper.  The  simulation  experiments  were  then  run  using  t is 
HS1I  configuration,  simulating  the  hardware  environment  of  a closely-coupled 
multiprocessor  where  processors  can  directly  communicate  with  each  other  throug 
shared  memory.  The  size  of  the  HSII  configuration  used  in  the  simulations  was  about 
180K,  36-bit  words;  70K  of  this  total  was  the  HSII  operating  system  plus  the  SAIL 
runtime  routines,  73K  was  precondition  and  knowledge  source  code  plus  variables,  and 
the  remainder  (which  varied  from  20K  to  A5K  depending  on  the  number  of  processors 
being  simulated  and  the  number  of  processes  being  instantiated)  repiesented  the 
blackboard  data  base  plus  process  activation  records  and  other  SAIL  working  speech 
The  simulations  were  carried  out  to  determine  the  efficiencies  of  the  various  HSII 
multiprocessing  mechanisms  discussed  previously,  as  well  as  to  gain  some  insight  into 
any  problems  that  might  arise  in  the  ensuing  implementation  of  a HSII  speech- 
understanding  system  for  the  Carnegie-Mellon  C.mmp  multiprocessor.  The  following 
sections  will  discuss  the  results  of  the  various  experiments  which  have  been  performe 
using  the  multiprocessor-simulation  version  of  the  HSII  speech-understanding  system. 


Simulation  Mechanisms  and  Simulation  Experiments 

The  various  multiprocessor  simulation  results  were  obtained  by  modifying  the 
flow  of  control  through  the  usual  HSII  multiprocessing  organization  to  allow  simulation 
scheduling  points  every  time  a running  process  could  interact  in  any  way  with  some 
other  concurrently  executing  process.  Such  points  included  blackboard  data  base 
accesses  and  data  base  access  synchronization  points  (including  attempts  to  acpuire 
data  base  resources,  both  at  the  system  and  user  levels,  and  any  resulting  points  o 


The  implementation"  of  the  C.mmp  version  of  the  HSII 

xu  l..„  fart  pssentiallv  a direct  mapping  of  the  utusysiem  iv 

implementation,  with'  additional  design  being  done  as  "ecessary  to  the ‘smlll 

ad°d^spteUpnrtblm",  ">»V  •»">  **  “V  n'0n’e"t  0"ly 

a 32K-word  window  into  the  centrally  located  main  memory). 
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process  suspension  due  to  the  unavailability  of  the  requested  resource,  as  well  as  the 
subsequent  points  of  process  wake-up  for  retrying  the  access  request).  Simulation 
scheduling  points  were  also  inserted  whenever  a data  modification  warning  message 
(triggered  by  modifying  a tagged  data  field)  was  to  be  sent,  as  well  as  whenever  a 
process  attempted  to  receive  such  a message.  The  scheduling  mechanism  itself  was  also 
modified  to  allow  for  the  simulated  scheduling  of  multiple  processing  units,  while 
maintaining  the  state  information  associated  with  each  processor  being  simulated  (such 
as  the  processor  clock  time  of  that  simulated  processor  and  the  state  of  the  particular 
process  being  run  on  that  processor).  The  simulation  runs  were  performed  so  as  to 
keep  the  processor  dock-times  of  each  processor  being  simulated  in  step  with  one 
another  (the  simulation  being  event-driven,  rather  than  sampled),  thereby  allowing  for 
the  accurate  measurement  and  comparison  of  concurrent  events  across  processors.  By 
selecting  the  number  of  processors  to  be  simulated  and  choosing  the  usual  scheduling 
parameters  and  precondition  and  knowledge-source  parameters,  a chronological  trace  of 
the  activity  of  each  process  and  processor  could  be  obtained.  By  accumulating  statistics 
during  the  trace  period  and  by  performing  various  post-processing  operations  upon  this 
activity  trace  record,  the  simulation  results  presented  in  the  following  sections  were 
obtained. 

Most  of  the  results  presented  here  were  achieved  by  using  a single  set  of 
knowledge  sources  (as  described  above),  with  a single  speech-data  input  utterance, 
keeping  the  data  base  locking  structure  and  scheduling  algorithms  essentially  fixed, 
while  varying  the  number  of  simulated  (identical)  processors.  Several  runs  were  also 
performed  to  test  the  effects  of  altering  the  knowledge-source  set,  altering  the  locking 
structure,  and  altering  the  mode  of  data  input  (the  normal  input  mode  being  a utterance- 
time-ordered  introduction  of  input  data  which  simulates  real-time  speech  input). 


Measures  of  Multiprocessing  Overhead:  Primitive  Operation  Timings 

Time  measurements  of  various  primitive  operations  were  made  using  a 10- 
microsecond  hardware  interval  timer.  Some  of  the  timed  primitive  operations  (such  as 
those  involving  simple  data  base  a ess  and  modification)  were  not  especially  subject  to 
the  fact  that  the  problem-solving  organization  involved  multiple  parallel  processes, 
whereas  others  (such  as  those  involving  process  instantiation  and  process 
synchronization)  were  directly  related  to  the  multiprocess  aspects  of  the  organization 
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(and  might  even  be  taken  in  part  as  overhead  when  compared  to  alternative  single- 
process system  organizations).  The  times  for  the  various  system  operations,  as  shown 
in  Table  1,  should  be  read  as  relative  values,  comparing  the  multiprocess-oriented 
operations  with  the  data  accessing  operations  to  get  a relative  feel  for  the  overheads 
involved  in  supposing  and  maintaining  the  multiprocess  organization  of  HSII.  Keep  in 
mind  that  such  time  measurements  are  highly  dependent  on  the  particular 
implementation  and  can  change  fairly  radically  when  implemented  differently.  In  fact,  a 
primary  use  of  such  timings  is  in  determining  operating  system  bottlenecks  so  that  such 
code  sections  can  be  rewritten  in  a more  Optimal  way.  As  a result,  some  primitive 
operations  reflect  execution  times  which  are  a result  of  extensive  optimization  attempts, 
while  other  operations  (in  particular,  the  "super  lock"  operations,  Lock!  and  unlock!) 
have  not  yet  been  subjected  to  this  optimization. 

Table  1 gives  timing  statistics  relating  to  the  costs  involved  in  maintaining  the 
shared,  centralized  blackboard  data  base.  Two  sets  of  statistics  are  given,  one  set 
showing  the  operation  times  without  the  influence  of  data  access  synchronization 
(blackboard  locking)  and  one  set  with  the  locking  structures  in  effect.  These  two  sets 
of  times  give  a quantitative  feeling  for  the  cost  of  data  access  synchronization 
mechanisms  in  this  particular  implementation  of  HSII.  The  figures  given  include  the 
average  runtime  cost  per  operation,  the  number  of  calls  (in  fhis  particular  timing  run)  to 
each  operation  (thereby  showing  the  relative  frequencies  of  operation  usage),  and  the 
percentage  of  the  overall  runtime  consumed  by  each  operation.  With  respect  to  the 
individual  entries,  creole, node  is  a composite  operation  (involving  many  field-writes  and 
various  local  context  updates)  for  creating  blackboard  nodes.  The  read.node.field  and 
write. node. field  operations  are  used  in  accessing  the  individual  fields  of  a node.  Note 
that  included  in  any  given  field-read  or  -write  operation  is  the  cost  of  perhaps  tagging 
(or  untagging)  that  particular  field  (or  its  node).  The  various  functions  of  the 
blackboard  monitoring  mechanism  a,e  contained  within  the  field-write  operations.  Thus, 
also  included  in  the  field-write  operation  is  the  cost  of  distributing  the  data  event 
resulting  from  the  write  operation  to  all  relevant  precondition  and  knowledge-source 
process  local  contexts,  as  well  as  the  cost  of  sending  tag  messages  to  all  processes 
which  may  have  tagged  the  field  being  .modified;  these  additional  costs  are  also 
accounted  for  independently  in  the  send.nisgs.and.events  and  notify, sset  table  entries. 
Field-write  operations  are  also  responsible  for  evaluating  any  pre-preconditions 
associated  with  the  field  being  modified  and  activating  any  precondition  whose  pre- 
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1.  total  runtime 

mean  time  (ms) 

number  of  calls 

w/o 

w/ 

w/o 

w / 

w/o 

w/ 

lock 

lock 

lock 

lock 

lock 

lock 

Blackboard  Accessing: 

- 

create.node 

6.96 

4.15 

35.81 

50.77 

287 

287 

read.node.field 

5.06 

15.68 

0.31 

2.03 

23577 

25279 

write. node.field 

14.13 

7.75 

13.96 

18.44 

1493 

1476 

Blackboard  Associative  Retrieval: 

retrieve 

2.72 

4.98 

25.07 

109.45 

160 

160 

get.time.adjacent 

9.31 

15.33 

23.44 

92.00 

586 

586 

get.struct. adjacent 

3.99 

6.31 

43.35 

163.20 

136 

136 

get.nodes.in.’rgn 

2.05 

0.87 

2.98 

3.00 

1015 

1015 

Process  Handling : 


invoke.ks 

5.29 

2.30 

22.64 

23.64 

345 

342 

create.ks.prcs 

0.75 

0.31 

3.21 

3.22 

345 

342 

ks.cleanup 

8.20 

5.24 

35.06 

53.94 

345 

342 

invoke.pre 

0.10 

1.04 

10.44 

10.59 

14 

14 

create.pre.prcs 

0.42 

0.40  • 

8.53 

19.57 

72 

72 

Local  Context  Maintenance: 


transfer. tags 

7.12 

2.99 

9.12 

9.17 

1152 

1149 

delete.all.tags 

0.52 

0.22 

2.01 

2.03 

383 

380 

notify.sset 

6.52 

3.01 

2.63 

2.92 

3665 

3626 

send.msgs.and.e  /ents 

4.04 

2.12 

3.68 

4.68 

1021 

1594 

receive.msg 

0.36 

0.15 

1.00 

1.01 

531 

530 

read.cset.or.sset 

0.11 

0.05 

0.84 

0.34 

• 192 

192 

Data  Access  Synchronization: 


lock! (overhead) 

7.78 

— 

57.47 

— 

476 

unlock!  (overhead) 

3.22 

— 

23.78 

— 

476 

lock.node 

— . 2.32 

— 

2.94 

— 

2770 

exam.node 

9.34 

— 

2.40 

— 

13675 

lock.rgn 

0.11 

— 

1.77 

— 

227 

write.access.chk 

0.41 

— 

0.98 

— 

1470 

read.access.chk 

14.45 

— 

1.60 

— 

31761 

Table  1.  Primitive  Operation  Times 
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precondition  is  satisfied,  Included  in  the  cost  of  reading  a data  field  (e.g., 
read. node. field)  is  the  cost  of  verifying  the  access  right  of  the  calling  process  to  the 
node  being  read  (which  could  involve  a temporary-locking  operation,1  the  cost  of  which 
is  also  given  independently  in  the  locknode  table  entry);  this  access-right  checking  cost 
is  aLo  separately  accounted  for  by  the  read.access.chk  operation.  It  should  be  noted 
that  because  most  of  the  mechanisms  required  to  implement  a data-directed  control 
structure  are  embedded  in  the  blackboard  write  operations,  the  time  to  execute  a write 
operation  is  significantly  more  expensive  than  a read  operation.  However,  the  actual 
cost  in  terms  of  total  run  time  of  implementing  a data-directed  control  structure  is 
comparatively  small  in  the  HSII  speech-understanding  system,  because  the  frequency  of 
read  operations  is  much  higher  than  that  of  write  operations.  If  this  relative  frequency 
for  read  and  write  operations  holds  for  other  task  domains  (e.g.,  vision,  robotics),  then  a 
data-directed  control  structure  (which  is  a very  general  and  modular  type  of  sequencing 

paradigm)  seems  to  be  a very  reasonable  framework  within  which  to  implement  such 
tasks. 


Additional  blackboard  operation  costs  are  described  in  the  Associative 
Retrieval  section  of  Table  1.  Associative  retrieval  is  based  on  specifying  partial  node 
descriptions  (called  matching  prototypes)  which  serve  as  a means  of  retrieving  the  set 
of  blackboard  nodes  fitting  that  partial  description.  Retrieve  represents  the  various 
retrieval  operations  possible  using  these  matching  prototypes.  Retrieval  from  the 
blackboard  may  also  be  done  by  requesting  the  nodes  which  are  time-adjacent 
(according  to  the  utterance-time  dimension  of  the  speech-understanding  blackboard)  or 
structurally  adjacent  (according  to  the  blackboard  graph  structure)  to  a given  node  (or 
set  of  nodes);  get.time.adjacent  and  get. struct. adjacent  perform  these  operations. 
Furthermore,  retrieval  may  be  done  by  requesting  the  set  of  nodes  contained  within  a 
certain  region  of  the  blackboard  (by  get.nodes.in.rgn). 

Table  1 also  relates  the  costs  of  process  handling  within  HSII.  Process 
invocation  and  process  creation  are  separated  (the  former  being  a request  from  a 
precondition  or  knowledge -source  process  to  the  scheduler  to  perform  the  latter),  and 
the  costs  are  accounted  separately,  as  in  irwoke.ks  and  create.ks.prcs.  Ks.cleanup  is  the 


If  a process  has  not  previously  locked  the  node  to  which  it  desires  access  and  the 

ThTnoH  „haVt  any  ,0tH3r  n0de  ,0Cked’  then  the  sy$tem  wi"  temporarily  lock 

the  node  for  the  duration  of  the  single  read  or  write  operation,  without  the  process 
having  explicitly  to  request  access  to  the  node. 
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cost  of  terminating  a knowledge-source  process;  preconditions  never  get  terminated. 
The  cost  of  initializing  and  terminating  a knowledge-source  process  (i.e.,  invoke.ks  and 
ks.cLea.nup)  is  due  to  the  overheads  involved  in  maintaining  local  contexts,  locking 
structures,  and  data  base  monitoring  (tagging),  all  of  which  are  necessitated  by  the 
multiprocess  nature  of  the  HS1I  organization.  However,  in  a relative  sense,  this  is  not 
expensive,  since  the  total  overhead  associated  with  process  handling  amounts  to  only 
about  97  of  the  overall  execution  time. 

Additionally,  local  context  maintenance  costs  are  given  in  Table  1,  since  they 
are  also  a coot  of  having  asynchronous  parallel  processes.  While  individual  tag  creation 
and  deletion  is  handled  by  the  primitive  field-read  and  -write  operations,  tags  may  be 
transferred  from  a precondition  to  the  knowledge  source  it  has  invoked  via  transfer.tags 
and  destroyed  at  termination  of  a process  via  delete.alLtags.  As  noted  above,  notify.sset 
and  send. msg-and. events  are  sub-operations  of  the  field-write  operations  and  represent 
the  cost  of  distributing  data  event  notifications  to  all  relevant  local  contexts. 
Receive.rn.se  is  the  operation  used  by  precondition  or  knowledge-source  processes  to 
receive  a tagging  message  (or  perhaps  wait  for  one,  if  one  does  not  yet  exist);  and 
read.cset.or.sset  is  the  operation  for  retrieving  the  information  from  a local  context. 

Finally,  Table  1 gives  the  costs  associated  with  the  data  access 
synchronization  mechanism.  Lock!  and  unlock!  represent  the  overhead  costs  of  locking 
and  unlocking  a group  of  nodes  specified  by  the  process  requesting  access  rights. 
These  two  operations  are  among  the  most  complex  routines  in  the  HSII  operating 
system,  the  complexity  arising  from  having  to  coordinate  the  allocation  of  data  base 
resources  by  two  independent  access  allocation  schemes  (node-locking  and  region- 
locking). This  coordination  is  necessary  in  order  to  avoid  any  possibility  of  data  base 
deadlock  by  maintaining  a homogeneous  linear  ordering  among  all  data  resources  (nodes 
and  regions).  The  costs  of  lock!  and  unlock!  do  not  include  the  time  spent  in  performing 
the  actual  primitive  locking  operations.  The  primitive  lock  costs  are  given  by  lock.node 
(lock  a node  for  exclusive  access),  exanunode  (lock  a node  for  read-only  access),  and 
lockxgn  (lock  a region  for  exclusive  access).  The  access-checking  operations 
(write.access.chk  and  read.access.chk)  are  used  by  the  blackboard  accessing  routines 
discussed  above. 

These  timing  statistics  can  be  used  to  determine  the  amount  of  system 
overhead  incurred  in  running  precondition  and  knowledge-source  processes  under  the 
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HSU  Operating  system.  The  following  summary  statistics  are  offered,  given  as 
percentages  of  the  total  execution  time,  the  percentages  being  calculated  so  as  to  avoid 
overlapping  between  categories  (as,  for  example,  factoring  blackboard  reading  costs  out 
of  blackboard  access  synchronization): 


Blackboard  reading 

167. 

Blackboard  writing 

47 

Associative  retrieval 

77 

Internal  computations  of  processes 

277 

Local  context  maintenance 

107 

Blackboard  access  synchronization 

277 

Process  handling 

97 

Another  way  of  viewing  these  figures  is  that  approximately  half  of  the  execution  time 
involves  multiprocessor  overheads  (i.e.,  local  context  maintenance,  blackboard  access 
synchronization,  and  process  handling).  Based  or.  the  assumption  that  this  multiprocess 
overhead  is  independent  of  the  parallelism  factor  achieved,*  then  a parallelism  factor  of 
2 or  greater  is  required  in  order  to  recover  the  multiprocess  overhead. 


Effective  Parallelism  and  Processor  Utilization 

The  problem-solving  organization  underlying  HSIi  was  designed  to  take 
maximum  advantage  of  any  separability  of  the  processing  or  data  components  available 
within  that  organization.  Knowledge  sources  were  intended  to  be  largely  independent 
and  capable  of  asynchronous  execution  in  the  form  of  knowledge-source  processes. 
Overall  system  control  was  to  be  distributed  and  primarily  data-directed,  being  based  on 
events  occurring  in  a globally  shared  blackboard  data  base.  The  intercommunication 
(and  interdependence)  of  the  various  knowledge-source  processes  was  to  be  minimized 
by  making  the  blackboard  data  base  the  primary  means  of  communication,  thereby 
exhibiting  an  indirection  with  respect  to  communication  similar  to  the  indirect  data- 
directed  form  of  process  control.  Such  a problem-solving  organization  was  believed  to 
be  particularly  amenable  to  implementation  in  the  hardware  environment  of  a network 
of  closely-coupled  asynchronous  processor-s  which  share  a common  memory.  Given 

A This  assumption,  based  on  timing  statistics  from  a series  of  runs  with  different 
numbers  of  processors,  seems  valid  except  for  the  cost  of  context  swapping  and 
process  suspension,  which  depends  upon  the  amount  of  data  base  interference  and 
the  number  of  processors. 
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sufficiently  many  completely  non-interfering  processes  (i.e,  processes  which  do  not 
interfere  in  any  way  with  the  execution  progress  of  one  another),  one  would  expect  the 
achieved  parallelism  (speed-up)  of  that  set  of  processes  executing  on  n identical 
processors  to  te  a factor  of  n , as  compared  to  the  same  set  of  processes  executing  on 
a single  processor  (assuming  the  same  scheduling  and  multiprocessing  overheads). 
While  the  HS1I  organization  attempted  to  allow  the  various  knowledge  sources  to  be  as 
independent  as  possible,  the  various  processes  were  to  cooperate  with  one  another 
(primarily  via  the  blackboard  data  base)  in  the  effort  to  effect  the  problem  solution. ^ 
This  necessary  cooperation  (and  the  various  forms  of  execution  interference  resulting 
from  it)  was  expected  to  result  in  the  achieved  parallelism  in  a multiprocessor 
environment  being  somewhat  less  than  the  potential  parallelism  without  interference. 

Several  experiments  were  run  to  measure  the  parallelism  achieved  in  this 
particular  implementation  of  the  HSH  problem-solving  organization  using  varying 
numbers  of  identical  processors.  Each  of  these  experiments  was  run  with  the 
knowledge-source  set  described  previously,  using  the  same  input  data  (introduced  into 
the  data  base  so  as  to  simulate  real-time  speech  input),  the  same  blackboard  locking 
structure,  and  the  same  scheduling  algorithm,  while  varying  the  number  of  (identical) 
processors.  An  example  of  the  graphical  output  produced  by  the  simulation,  for  the 
case  of  eight  processors,  is  disp'ayed  in  Figure  3.  To  comment  on  these  activity  plots, 
the  "n  runnable  processes"  plot  gives  the  number  of  processes  either  running  or  ready 
to  run  at  each  simulation  scheduling  point;  the  "#  running  processes"  plot  gives  the 
number  of  actively  executing  processes  at  each  scheduling  point;  the  "#  ready 
processes"  plot  shows  the  number  of  processes  awaiting  assignment  to  a processor  at 
each  scheduling  point;  and  the  "«  suspended  processes"  plot  gives  the  number  of 
processes  blocked  from  executing  because  of  data  access  interference  or  because  they 
are  waiting  on  the  receipt  of  a tagging  message. 

Referring  to  Figure  3c,  notice  the  spiked  nature  of  the  ready-processes  plot. 
This  is  a result  of  delaying  the  execution  of  a precondition  (due  to  the  limited 
processing  power  available)  beyond  the  point  in  time  at  which  its  pre-precondition  is 

1 Note  that  the  size  of  the  HSII  blackboard  is  expected  to  grow  to  only  several 
thousand  nodes  (hypotheses  and  links),  at,  say,  25  field  entries  apiece,  depending,  of 
course,  on  the  task  domain.  Thus,  it  is  assumed  (for  the  purposes  of  the  current 
investigations,  at  least)  that  the  blackboard  is  entirely  resident  in  primary  memory; 
thus,  inpui/output  operations  are  not  an  issue  here,  the  system  being  essentially 
compute-bound. 
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first  satisfied:  the  longer  a precondition  is  delayed,  the  more  data  ovents  it  is  likely  to 
accumulate  in  the  meantime,  and  the  more  knowledge-source  processes  it  is  likely  to 
instantiate  once  it  does  get  executed;  hence  the  spiked  nature  of  the  resultant  ready- 
processes  plots  for  configurations  of  few  processors.  As  parallel  processing  power 
increases,  preconditions  can  more  often  be  run  as  soon  as  their  pre-preconditions  are 
initially  satisfied,  and  the  spiking  phenomenon  subsides. 

As  an  example  of  how  these  activity  plots  have  been  used  in  upgrading  the 
performance  of  the  implementation,  compare  Figure  A to  Figure  3c.  Figure  A depicts  the 
process  activity  under  the  control  of  a scheduler  which  did  not  attempt  to  perform  load 
balancing  with  respect  to  ready  preconditions;  and  as  a result  of  not  increasing  the 
relative  scheduling  priority  of  preconditions  as  they  received  more  and  more  data 
events,  the  activity  spike  phenomenon  referred  to  above  became  predominant,  to  the 
extent  of  reducing  process  activity  to  a synchronous  system  while  the  long-time  waiting 
precondition  instantiates  a great  many  knowledge-source  processes  all  at  once.*  Figure 
3c  shows  the  activity  on  the  same  number  of  processors,  but  using  a somewhat  more 
intelligent  scheduling  algorithm,  with  a resulting  reduction  in  the  observed  spiking 
phenomena.  This  improved  scheduling  strategy  is  the  one  used  for  all  plots  presented 
herein. 

In  addition  to  the  plots  described  above,  various  other  measures  were  made 
to  allow  an  explicit  determination  of  processor  utilization  and  effective  parallelism  for 
varying  numbers  of  processors.  Referring  to  Table  2,  one  can  get  a feeling  for  the 
activity  generated  by  employing  increasing  numbers  of  processors.  All  simulations 
represented  in  Table  2 were  run  for  equivalent  amounts  of  processing  effort  with 
respect  to  the  results  created  in  the  blackboard  data  base  by  the  knowledge  source 
activity.  The  final  clock  time  of  the  multiprocessor  configuration  being  simulated  Is 
given  in  simulated  real-time  seconds,  and  the  accumulated  processor  idle  and  lost  times 
are  also  given.  Idle  time  is  attributed  to  a processor  when  it  has  no  process  assigned 
to  it  and  there  are  no  ready  processes  to  be  run;  lost  time  is  attributed  when  the 
process  on  a processor  is  suspended  for  any  reason  and  there  are  no  ready  processes 

1 This  can  be  inferred  from  Figure  A by  noting  that  the  sample  points  (vertical  tick 
marks)  are  taken  at  each  simulation  scheduling  point,  and  the  lack  of  samples  between 
times  220  and  380  indicates  that  the  process  that  started  running  at  220  had  no 
concurrently  running  processes  competing  with  it  until  time  380,  when  there  were 
suddenly  25  new  processes  contending  for  computing  resources. 
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number  of  prcrs 
(all  limes  in  secs) 

1 

2 

KS  instantiations 

355 

401 

PRE  activations 

82 

126 

multiprcr  clock  time 

1076 

634 

total  idle  time 

9 

15 

total  lost  time 

0 

5 

avg  ext  swaps 

0 

309 

avg  prer  utilization 

997 

987 

effective  * prcrs 

0.99 

1.96 

utilization  speed-up 

1.00 

1.98 

4 

8 

16  32 

(special*) 

423 

421 

415 

434 

173 

213 

200 

229 

389 

350 

351 

43 

37 

380 

2608 

867 

34 

900 

1546 

0 

942 

368 

9 

0 

S57 

547 

267 

377 

3.80 

4.32 

4.16 

11.84 

3.84 

4.36 

4.20 

11.96 

* The  32-processor  column  represents  an  experiment  which 
was  run  under  special  conditions,  to  be  explained  below, 
and  should  not  be  compared  directly  to  the  other  columns 
of  the  table. 


Table  2.  Processor  Utilization 


which  could  be  swapped  in  to  replace  the  suspended  process.  Processor  utilization 
(calculated  using  the  final  clock  time  and  processor  idle  and  lost  times)  is  given  in  Table 
2;  Figure  5 shows  the  corresponding  effective  parallelism  (speed-up),  based  on  the 
processor  utilization  factors  of  Table  2, 

The  speed-up  for  this  particular  selection  of  knowledge  sources  is 
appreciable  up  to  four  processors,  but  drops  off  substantially  as  one  approaches 
sixteen  processors.  In  fact,  a rather  distressing  feature  of  this  effective  parallelism  plot 
is  that  the  speed-up  actually  decreases  slightly  in  going  from  eight  processors  to  a 
sixteen-processor  configuration  (from  a speed-up  of  4,36  over  the  uniprocessor  case, 
down  to  4.20).  This  may  be  explained  by  noting  that  both  the  eight-  and  sixteen- 
processor  runs  had  approximately  equal  final  clock  times!  but  in  the  sixteen-processor 
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case,  the  number  of  runnable  processes  never  exceeded  sixteen  processes,  so  any 
ready  process  could  always  be  accommodated  immediately.  As  a result,  the  number  of 
Knowledge-source  instantiations  and  precondition  activations  fell  off  a bit  from  the 
eight-processor  case,  because  the  preconditions  were  more  likely  to  be  fully  satisfied 
the  first  time  Ihey  were  activated  (since  all  ready-processes,  knowledge-source 
processes  in  particular,  could  be  executed  immediately  and  complete  their  intended 
actions  sooner,  so  that  when  a precondition  came  to  be  activated,  it  would  more  likely 
find  its  full  data  pattern  to  be  satisfied);  thus,  preconditions  would  not  often  be  aborted, 
having  to  be  re-tested  upon  receiving  a subsequent  data  event.  However,  running 
fewer  preconditions  resulted  in  much  more  idle  time  for  the  sixteen-processor 
configuration  (the  increase  in  lost  time  indicated  in  Table  2 is  an  artifact  of  having  too 
many  processors  available,  since  suspended  processes  would  teno  to  remain  on 
otherwise  idle  processors  rather  than  being  swapped  off  the  processor  — note  the 
rather  dramatic  decrease  in  context  swaps  indicated  by  Table  2 for  the  sixteen- 
processor  case).  The  result  is  a lower  proportionate  utilization  of  the  processor 
configuration,  and  hence  a decrease  in  the  effective  parallelism  from  the  eight- 
processor  configuration  to  the  sixteen-processor  configuration. 

Due  to  the  limited  state  of  development  of  the  total  set  of  knowledge  sources, 
the  set  of  knowledge  sources  used  in  the  simulation  was  necessarily  limited;  so  the  fact 
that  these  plots  indicate  that  not  more  than  about  four  to  eight  processors  are  being 
effectively  utilized  is  not  to  say  that  the  full  HSIl  speech-understanding  system  needs 
only  eight  processors.  One  might  ask  that  if  only  4.16  processors  of  the  sixteen- 
processor  configuration  are  being  totally  utilized  (see  Table  2),  what  is  the  maximum 
potential  effective  parallelism,  given  this  set  of  knowledge  sources?  To  answer  this 
question,  an  experiment  was  performed  in  which  effectively  infinite  processing  power 
was  provided  to  this  knowledge-source  set  and  all  data  access  interference  was 
eliminated  (by  removing  the  locking  structure  overheads  and  blocking  actions);  the 
scheduling  algorithm  was  kept  unchanged,  as  was  the  input  data,  although  the  input  data 
stream  was  entered  so  as  to  be  instantaneously  available  in  its  entirety  (rather  than 
being  introduced  in  a simulated  real-time,  "left-to-right"  manner).  The  results  of  this 
experiment  are  summarized  by  the  32-processor  column  of  Table  2 (32  processors  was 
an  effective  infinite  computing  resource  in  this  case,  since  eight  of  the  processors  were 
never  used  during  the  simulation).  Notice  that  no  lost  time  was  attributed  to  the  run, 
due  to  the  lack  of  locking  interference;  and  the  resultant  processor  utilization  was  372 
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of  32  processors,  or  1 1.  84  totally  utilized  processors.  Thus,  data  base  interference 
caused  by  particular  data  base  accessing  patterns  and  associated  locking  structures  of 
the  knowledge  source  set  used  in  the  experiment  significantly  affected  processor 
utilization)  if  the  use  of  the  locking  structures  could  be  accomplished  in  a more  non- 
interfering manner,  the  speed-up  indicated  by  the  eight-  or  sixteen-processor 
configurations  could  be  increased  substantially.  The  next  section  will  analyze  in  detail 
the  exact  causes  for  this  data  base  inteference,  and  propose  changes  to  the  knowledge- 
source  locking  structure  so  as  to  reduce  potential  inteference. 

Table  3 presents  some  other  system  configurations  to  show  effective 
processor  utilizations  under  varying  conditions.  The  first  row  repeats  the  statistics  of 
the  sixteen-processor  case  of  Table  2;  the  second  row  is  a summary  of  the  32- 
processor  case  of  Table  2,  as  described  above.  Three  further  data  points  are  offered 
to  indicate  the  effects  of  increasing  the  size  of  the  knowledge-source  set.  The  last 
three  rows  of  Table  3 involve  experiments  using  an  expanded  knowledge-source  set 
consisting  of  the  knowledge  sources  of  all  the  previous  runs  plus  the  Syntactic  Word 
Hypothesize r (see  Appendix  A)  and  its  precondition.  Using  this  ex  panded  knowledge - 
source  set,  simulations  were  performed  to  evaluate  the  effects  of  this  knowledge-source 
set  on  a sixteen-processor  configuration  wi's.  * ie  locking  structure  in  effect,  presenting 
the  input  data  in  the  usual  "left-to-righi"  manner,  as  well  as  in  the  instantaneous 
manner  used  in  the  infinite-processor  test.  Comparing  the  results  (in  Table  3)  to  the 
original  sixteen-processor  run,  the  "left-to-right"  input  scheme  achieved  a processor 
utilization  of  337,  up  77  from  the  smaller  knowledge-source  set  case;  and  by  presenting 
all  input  data  simultaneously,  the  utilization  rose  to  357,  The  fifth  row  of  Table  3 
represents  the  results  of  providing  effectively  infinite  computing  power  (only  25 
processors  were  ever  used  during  the  run)  to  the  expanded  knowledge-source  set  and 
eliminating  all  data  access  interference,  in  the  same  manner  as  for  the  experiment  of  the 
second  row.  In  this  "optimal"  situation  for  the  expanded  knowledge-source  set, 
processor  utilization  was  measured  at  467,  or  14.72  totally  utilized  processors.  Again,  it 
may  be  noted  that  a more  effective  (less  interfering)  use  of  the  locking  structures  can 
result  in  substantial  increases  in  processor  utilization  and  effective  parallelism. 

The  addition  of  the  Syntactic  Word  Hypothesizer  was  able  to  achieve  the 
increases  in  utilization  noted  in  Table  3 because  it  operates  on  lexicons  that  are  used 
by  only  one  other  knowledge  source  (the  Phoneme  Hypothesizer)  in  the  basic 
knowledge-source  set)  hence,  the  process  interference  introduced  by  adding  this 
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experiment 

description 


multiprcr 

clock 


total 

idle 


total 

lost 


7 util  effective 
* prcrs 


8 KS’s,  6 PRE’s  351  2608  1546  267.  4.16 

16  prcrs,  w/  lock 
I -to -r  input 

8 KS’s,  6PRE’s  43  867  0 377.  11.84 

32  prcrs,  w/o  lock 
instantaneous  input 


9 KS’s,  7 PRE’s 
16  prcrs,  w/  lock 
l-to-r  input 

9 KS’s,  7 PRE’s 
16  prcrs,  w/  lock 
instantaneous  input 

9 KS’s,  7 PRE’s 
32  prcrs,  w/o  lock 
instantaneous  input 


148  854 


155  839 


13  226 


726  337.  5.28 


784  357  5.60 


0 467  14.72 


Table  3,  System  Configuration  Variations 


knowledge  source  was  minimal,  Unfortunately,  the  development  of  knowledge  sources 
at  lexicon  levels  which  more  directly  conflict  with  those  of  existing  knowledge  sources 
has  been  limited,  so  direct  experimentation  on  the  interfering  effects  of  such  knowledge 
sources  could  not  be  performed;  but  based  on  the  observations  comparing  the  32- 
processor  without-lock  experiments  to  the  original  sixteen-processor  with-lock  runs, 
substantial  interference  due  to  ineffective  use  of  the  locking  structure  would  be 
expected  in  such  cases  of  adding  "compel, ,ig"  knowledge  sources.  One  mitigating 
circumstance  which  could  alleviate  such  interference  was  noted  in  the  'instantaneous" 
input  case  of  the  expanded  knowledge-source  set  case,  as  compared  to  the  "left-to- 
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right"  input  case:  if  process  activity  can  be  spread  across  the  utterance-time  dimension 
of  the  blackboard,  process  interference  would  decrease  --  but  interference  due  to  data 
access  synchronization  interference  can  easily  overwhelm  this  improvement.  Further 
experiments  along  these  lines  will  be  attempted  as  the  appropriate  knowledge  sources 
become  available  for  use. 


Execution  Interference  Measurements 

In  addition  to  the  primitive  operation  timings  and  achieved  parallelism 
measurements  given  above,  various  other  measurements  were  made  to  determine  the 
various  aspects  of  system  performance  as  related  to  multiprocessing.  As  has  already 
been  mentioned,  a major  concern  in  a multiprocess  environment  in  which  the  various 
processes  are  not  entirely  independent  is  that  of  execution  interference.  Execution 
interference  may  arise  whenever  any  given  process  enters  a critical  section  within 
which  it  requires  the  integrity  of  a given  data  structure  be  maintained  (thereby 
necessitating  a means  by  which  to  disallow  access  to  others  until  the  critical  section  is 
exited).  Execution  interference  may  also  arise  whenever  processes  must  synchronize 
their  activities  and  perhaps  cause  themselves  to  wait  on  an  event  based  on  an  action 
which  is  to  be  performed  by  some  external  process.  Thus  execution  interference  may 
arise  due  to  causes  external  to  the  process  being  delayed  (as  in  the  case  of  trying  to 
access  a data  structure  which  is  currently  held  for  exclusive  access  by  another 
process),  or  the  interference  may  arise  due  to  causes  internal  to  the  process  being 
delayed  (as  when  a process  delays  itself  by  waiting  for  the  occurrence  of  an  externally 
caused  event).  As  a result  of  the  HSII  design  philosophy,  which  states  that  the  various 
knowledge-source  processes  should  be  as  independent  as  possible  in  specification  and 
execution,  most  of  the  execution  interference  experienced  in  HSII  is  of  the  external 
variety,  wherein  a process  is  delayed  due  to  external  causes  unknown  to  itself  (and  the 
delay  itself  is  transparent  to  the  process  being  delayed). 

As  previously  described,  there  are  two  methods  in  the  HSII  system  for 
preserving  data  integrity;  a)  guaranteeing  exclusive  access  through  the  use  of  node- 
and  region-locking  primitives,  and  b)  plating  data  assumptions  in  the  blackboard, 
through  tagging  primitives,  which  when  violated  cause  a signal  to  be  sent  to  the  process 
making  the  assumption.  There  is  an  interesting  balance  in  terms  of  execution  overhead 
and  execution  interference  between  these  two  techniques.  The  region-locking 
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technique  is  least  costly  in  terms  of  execution  overhead  and  is  the  easiest  to  embed  in  a 
program  but  causes  the  most  execution  interference.  This  is  in  contrast  to  the  use  of 
tagging  which  is  the  most  costly  in  terms  of  execution  overhead  and  is  the  most  difficult 
to  embed  in  a program  but  causes  the  least  execution  'nterference.  Both  these  methods 
were  used  for  guaranteeing  data  integrity  in  the  precondition  and  Knowledge-source  set 
that  was  used  in  the  simulation  experiments. 

In  structuring  each  Knowledge  source  so  as  to  preserve  its  data  integrity,  no 
a priori  assumptions  were  made  about  the  non-modif ,ability  of  any  blacKboard  data  that 
Knowledge  source  used  in  its  processing  (i.e.,  it  was  assumed  that  any  blacKboard 
information  that  the  Knowledge  source  read  could  perhaps  be  modified  by  some  other 
concurrent  Knowledge-source),  This  self-contained  approach  to  the  design  of  a 
knowledge  source’s  locKing  and  tagging  structure  is  required  if  the  modularity  of  the 
system,  with  respect  to  deletion  or  addition  of  Knowledge  sources,  is  to  be  preserved. 

The  Knowledge  sources  that  were  used  in  the  simulation  experiments  were 
not  originalh  designed  so  that  they  could  be  interrupted  at  arbitrary  points  in  their 
processing,  and  consequently  they  lacKed  the  appropriate  locKing  and  tagging  structure 
to  guarantee  data  integrity  in  a multiprocess(or)  environment.  The  addition,  as  an 
afterthought,  of  the  appropriate  locKing  and  tagging  structure  to  these  Knowledge 
sources  was  sometimes  quite  difficult.  This  was  an  especially  serious  problem  when  an 
attempt  was  made  to  put  tagging  primitives  into  Knowledge  sources  which  had  internal 
backtracking  control  structures  for  searching  the  node  graph  structure  in  the 
blackboard.  This  difficulty  arises  because  previously  made  data  assumptions  (tags  in  the 
blackboard)  associated  with  a partial  path  (sequence  of  nodes  in  the  blackboard)  must 
be  removed  upon  discovering  that  the  path  cannot  be  successfully  completed.  Thus, 
most  of  the  knowledge  sources  in  the  experiment  did  not  use  tagging  as  a method  of 
guaranteeing  integrity,  but  rather  used  a combination  of  node-  and  region-locking. 
However,  preconditions,  which  have  a much  simpler  structure  and  generally  do  not  write 

tw 

in  the  blackboard,  were  modified  to  use  the  tagging  mechanism,  In  addition,  to  further 
simplify  knowledge-source  locking  structures,  region-locking  was  used  wherever 
possible.  This  excessive  use  of  region-locking  was  mainly  responsible  for  the  significant 
amount  of  interference  among  processes  which  caused  the  effective  processor 
utilization  to  go  from  an  optimal  12  to  a realized  4 (see  Table  2). 

Figure  6 shows  an  interesting  case  demonstrating  that  the  indiscriminate  use 
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of  region-locking  can  obstruct  the  execution  progress  of  many  processes  and  thereby 
temporarily  reduce  the  effective  parallelism  of  the  system.  It  represents  a snapshot  of 
the  blackboard  locking  structure  taken  during  the  execution  of  the  simulation.  The  grid 
structure  represents  the  two-dimensional  abstract  data  structure,  the  dimensions  being 
lexicon  level  and  region  element  number  (corresponding  to  the  utterance-time 
dimension).  At  the  point  of  each  snapshot,  the  outstanding  node  and  region  locks  are 
indicated,  as  well  as  the  areas  requested  (but  not  yet  obtained)  by  suspended 
processes.  The  various  (non-interfering)  tags  placed  throughout  the  data  base  are  also 
indicated.  The  key  indicates  the  sets  of  active  and  suspended  processes  (>he  names 
referring  to  the  precondition  and  knowledge  source  names,  and  the  numbers  in  the 
names  indicating  a process  instantiation  index  unique  to  that  particular  process).  This 
particular  snapshot  was  taken  from  the  sixteen-processor  simulation  run  with  the 
smaller  knowledge-source  set.  Notice  that  PSVN263  has  locked  regions  at  the  PHON, 
MXN,  and  PSEG  lexicon  levels  for  its  exclusive  access;  the  nodes  locked  by  PSYN263 
(hypotheses  being  indicated  by  H<sequence  number*,  and  links  by  L<sequence  number>) 
within  these  regions  are  those  being  created  by  PSYN263,  hence  the  reason  for  the 
region  locks.  Unfortunately,  this  locking  action  resulted  in  the  suspension  of  six  other 
processes  awaiting  access  to  parts  of  the  PHON  and  PSEG  lexicon  levels  which  overlap 
PSYN263’s  region-locks.  Each  of  these  suspended  processes  is  waiting  to  acquire 
access-rights  to  a node  in  these  locked  regions;  in  fact,  PREtPSYNlPSYN  and  CSEG259 
are  both  waiting  on  the  same  node  (HI 41 ).  The  diagram  also  shows  the  various  (non- 
interfering) tags  which  were  placed  on  the  various  nodes  at  the  PHON  and  PSEG  lexicon 
levels  by  three  of  the  processes  at  some  previous  time.  Figure  7,  which  is  anothe- 
snapshot  of  locking  structure,  shows  a case  where  execution  interference  was  not  so 
significant. 

The  reason  the  locking  structure  plots  are  localized  in  the  lower  left-hand 
corner  of  the  blackboard  structure  is  that  the  construction  of  the  data  base  in  the 
speech-processing  task  is  initially  left-to-right  due  to  the  time-sequential  nature  of  the 
speech  input.  Also,  the  particular  set  of  knowledge  sources  chosen  for  use  in  the 
simulation  experiments  happered  to  be  an  effectively  bottom-up  speech  recognition 
system  (some  of  the  top-down  knowledge  sources  having  not  yet  been  developed  to  a 
stable  enough  state  to  have  been  used  in  the  simulations);  hence,  activity  starts  in  the 
lower  left-hand  corner  of  the  blackboard.  Further  simulations  are  planned  which  will 
work  in  a combined  top-down  and  bottom-up  fashion,  thereby  increasing  the  potential 
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parallelism  (since  the  top-down  knowledge  sources  will  presumably  not  interfere  with 
the  execution  of  the  bottom-up  knowledge  sources  as  much  as  additional  competing 
bottom-up  knowledge  sources  would).  The  expanded  knowledge-source  set  experiments 
presented  above  were  a first  step  in  introducing  such  top-down  knowledges  as  more 
knowledge  sources  become  available,  their  various  interference  effects  will  be 
investigated.  Also,  other  tasks  which  could  use  the  HSIl  organization  might  not 
necessarily  have  the  left-to-right  input  characteristics  of  speech,  so  future  simulations 
will  also  test  a more  distributed  input  pattern,  thereby  also  increasing  the  potential 
parallelism  by  spreading  the  pro  „s  activity  across  the  breadth  of  the  blackboard;  the 
several  experiments  presented  above  which  introduced  tha  input  in  on  instantaneous 
manner  were  the  initial  attempts  in  this  direction. 


A more  analytic  approach  to  analyzing  the  data  access  interference 
experienced  by  precondition  and  knowledge  source  processes,  for  varying  numbers  of 
processors,  is  given  in  Table  A 


number  of  prcrs 

1 

2 

4 

8 

16 

(all  times  in  secs) 

avg  BB  accesses/KS 

54,4 

52.8 

54.5 

53.9 

56.4 

avg  BB  accesses/PRE 

96.7 

68.7 

55.7 

48.2 

51.1 

avg  prim  locks/KS 

27.9 

27.4 

28.0 

25.7 

26.9 

avg  prim  locks/PRE 

96.7 

68.7 

55.7 

48.2 

51.1 

avg  dsched/prim  lock(KS) 

0 

0.020 

0.060 

0.055 

0.053 

avg  dsched/prim  lock(PRE) 

0 

0.009 

0.026 

0.045 

0.040 

avg  dsched  duration/KS 

0 

5.08 

5.69 

1.75 

1.90 

avg  dsched  duration/PRE 

0 

3.95 

1.91 

1.35 

1.86 

avg  ext  swaps 
avg  ext  swaps/dsched 

0 

0 

309 

1.03 

942 

0.97 

368 

0.36 

9 

0.01 

Table  4. 

Data  Access  Characteristics 
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Essentially,  Table  4 is  an  extension  of  Table  2,  which  was  discussed  in  the 
previous  section  (i.e.,  the  underlying  simulation  runs  were  the  same  for  both  tables). 
Execution  interference  was  measured  bv  recording  the  amount  of  process  suspension 
(also  called  descheduling),  which  result:  from  processes  being  temporarily  blocked  in 
their  atter  pts  to  gain  access  to  some  part  of  the  blackboard  data  base.*  As  might  be 
expected,  as  process  activity  increases  with  increasing  numbers  of  processors,  the 
possibility  of  execution  interference  increases  (see  table  entries  on 
"deschedules/primitive  lock").  This  phenomenon  stops  at  eight  processors  because  in 
these  simulation  exper  ments  there  were  rarely  more  than  eight  processes  executing  at 
any  given  moment.  At  the  same  time,  with  more  and  more  processing  power  available, 
the  likelihood  of  suspended  processes  being  unblocked  and  becoming  available  for 
further  processing  increases  as  the  number  of  processors  increases  (see  table  entries 
on  "deschedule  duration").  This  phenomenon  is  also  indicated  by  the  significant 
decrease  in  processor  context  swaps  per  deschedule  (i.e.,  with  more  processors,  it 
becomes  less  likely  that  when  a process  is  suspended  there  will  be  another  process 
ready  tn  execute). 

The  major  point  that  can  be  drawn  from  this  table  is  that  the  decrease  in 
processor  utilization  caused  by  the  locking  structure  is  not  due  to  the  high  rate  of  data 
access  interference  (i.e.,  at  most  only  67,  of  the  primitive  locks  result  in  deschedules) 
but  rather  from  the  long  duration  over  which  deschedulad  processes  are  blocked.  This 
deschedule  duration,  in  the  optimal  case  of  16  processors,  where  processes  do  not  have 
to  wait  for  for  an  available  processor,  is  approximately  2 seconds,  which  is  very  close 

* The  number  of  deschedules  attributed  to  a process  is  related  to  the  inner  workings 
of  the  locking  mechanism.  Not  only  is  the  granularity  of  the  locking  structure 
important  (i.e.,  how  small  a piece  of  the  blackboard  data  base  can  be  requested  for 
access  allocation),  but  the  granularity  of  the  process  blocking  mechanism  is  important. 
For  example,  processes  could  be  blocked  upon  trying  to  gain  access  to  a node  and 
then  relegated  to  waiting  in  a set  of  processes  which  are  waiting  on  any  node  at  the 
level  of  the  requested  node;  or  the  wait  set  could  be  divided  according  to  the 
individual  nodes  being  waited  upon.  If,  in  an  attempt  to  conserve  semaphore 
structures,  the  former  strategy  is  chosen,  it  could  become  quite  expensive  to 
determine  whether,  upon  receiving  an  unlock  wake-up  signal  for  the  wa  t set,  a 
particular  member  of  the  v/ait  set  is  really  re-schedulable  as  a result  of  that  wake-up 
signal;  hence,,  it  may  be  cheaper  to  release  all  waiting  processes  in  the  set,  even 
though  all  but  one  will  just  become  descheduled  again.  If  the  single-node  wait  set  is 
used,  the  costs  of  maintaining  separate  semaphores  for  every  possible  data  object 
may  become  prohibitively  expensive,  although  process  re-scheduling  would  not  be 
done  unnecessarily  in  such  a scheme. 
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to  the  average  run  time  of  a knowledge  source.  This  lorg  duration  occurs  because  the 
knowledge-source  locking  structures  involve  executing  region  locks  at  the  beginning  of 
the  knowledge  source  execution.  These  region-locks  define  the  entire  blackboard  area 
(and  perhaps  even  more)  that  the  knowledge  source  will  either  examine  or  modify 
during  its  entire  execution. ^ These  locks  are  then  released  only  at  the  termination  of 
the  knowledge  source  execution.  Thus,  if  data  access  interference  (i.e.,  a primitive  lock 
deschedule)  occurred  because  of  a previously  executed  region-lock,  then  the  suspended 
process  would  very  likely  not  be  unblocked  until  the  knowledge  source  executing  the 
region-lock  had  completed  its  processing, 

Finally,  it  is  once  again  admitted  that  the  results  presented  here  are  derived 
from  a rather  limited  selection  of  knowledge-source  processes,  the  coding  style  of 
which  may  be  affected  by  the  various  efficiencies  and  inefficiencies  of  the  particular 
implementation  of  the  HSII  system  organization.  In  particular,  since  the  HSII  speech- 
understanding system  is  under  constant  development,  various  code  sections  involving 
the  system  operations  -»ve  been  subject  to  extensive  optimization  attempts,  while  other 
sections  have  not  yet  had  the  benefit  of  such  optimization.  Additionally,  the  results  are 
biased  by  the  task  domain  (viz.,  speech  understanding)  and  the  data  structure  chosen  to 
represent  the  dynamic  solution  state  of  the  task.  However,  it  is  hoped  that  the  system 
organization  (including  ihe  data  base  design)  is  of  sufficiently  general  character  that 
these  particular  results  at  least  give  a feeling  for  the  results  that  might  be  expected 
using  a different  set  of  knowledge-source  processes  to  solve  the  same  or  different 
problems. 


SUMMARY  AND  CONCLUSIONS 


This  paper  has  presented  a design  for  the  organization  of  knowledge-based 
AI  problem-solving  strategies  which  is  felt  to  be  particularly  applicable  for 
implementation  on  closely-coupled  multiprocessor  computer  systems.  The  method  of 
design  is  a result  of  formulating  the  problem-solving  organization  in  terms  of  the 

^ Note  that  the  number  of  primitive  lock  operations  for  preconditions  is  equal  to  the 
number  of  blackboard  accesses  (from  the  precondition  process  averages  of  Table  4): 
preconditions  do  not  usually  need  a long-lasting  locked  environment  (since  they  do 
not  modify  the  blackboard  except  to  place  tags  into  it),  thus  ea^h  access  is 
individually  protected  by  the  HSil  operating  system  (via  temporary-locking),  rather 
than  having  the  precondition  pe’form  an  explicit  LOCK!  operation  before  each  access. 
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hypothesize  and-tcst  paradigm  for  heuristic  search,  where  the  various  hypothesizers 
and  testers  are  represented  as  Knowledge  sources  applicable  to  the  task  domain  of  the 
problem  being  solved.  A knowledge  source  may  be  described  as  an  agent  that  embodies 
the  Knowledge  of  a particular  aspect  of  the  problem  domain  and  is  useful  in  solving  a 
problem  from  that  domain  by  performing  actions  based  on  its  Knowledge  so  as  to 
further  the  progress  of  the  overall  problem  solution.  The  hypothesize-and-test 
paradigm  provides  the  conceptual  means  of  coordinating  these  various  knowledge 

source  activities  by  suggesting  that  it  is  the  function  of  some  knowledge  sources  to 
create  hypotheses  representing  a possible  (perhaps  partial)  solution  state  for  the  given 
problem.  Hypotheses  are  created  in  a global  data  base  and  are  available  for  inspection 
by  all  Knowledge  sources.  It  is  the  responsibility  of  other  knowledge  sources  to 
evaluate  these  hypotheses  in  light  of  their  own  knowledge  of  the  task  domain,  and 
either  accept  or  reject  the  hypotheses,  or  propose  their  own  alternative  hypotheses 
(by  either  modifying  the  existing  hypotheses  or  creating  entirely  new  ones). 

The  Hearsay  II  speech-understanding  system  (HSN),  which  has  been 

developed  at  ''arnegie-Mellon  University  using  the  techniques  for  system  organization 
described  here,  has  provided  a context  for  evaluating  this  system  architecture.  The 
HSII  organization  provides  the  facilities  necessary  for  knowledge-source  cooperation 
through  the  hypothesize-and-test  paradigm  to  be  carried  out  in  a highly  asynchronous 
and  data-directed  manner,  where  knowledge  sources  are  specified  as  independent 
processing  entities  capable  of  parallel  execution;  the  activities  of  any  given  collection  of 
such  knowledge  sources  are  coordinated  by  the  hypothesize-and-test  paradigm  through 
the  use  of  a shared  global  data  base  called  the  blackboard. 

In  specifying  the  blackboard  as  the  primary  means  of  interprocess 

communication,  particular  attention  has  been  paid  to  resolving  the  data  access 
synchronization  problems  and  data  integrity  issues  arising  from  the  asynchronous  data 
access  patterns  possible  from  the  various  independently  executing  parallel  knowledge- 
source  processes.  A non-preemptive  data  access  allocation  scheme  was  devised  in 
which  the  units  of  allocation  could  be  linearly  ordered  and  hence  allocated  according  to 
that  ordering  so  as  to  avoid  data  deadlocks.  The  particular  units  of^data  allocation 
(locking)  were  chosen  as  being  either  blackboard  nodes  ( node-locking ) or  abstract 
regions  in  the  blackboard  ( region-locking ).  Blackboard  nodes  also  represent  the  units 
of  data  creation  within  the  blackboard.  The  region-locking  mechanism  views  the 

potential  blackboard  as  an  abstract  data  space  in  which  access  rights  to  abstract 
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regions  could  be  granted  without  regard  to  the  actual  data  content  of  these  regions. 

Another  area  of  concern  relating  to  the  use  of  a shared  blackboard-like  data 
facility  relates  to  the  assumptions  made  by  the  various  executing  knowledge  sources 
concerning  issues  of  data  integrity  and  localized  data  contexts.  Since  the  blackboard  is 
intended  to  represent  only  the  most  current  global  status  of  the  problem  solution  state, 
mechanisms  were  introduced  to  alls  individual  knowledge  sources  to  retain  recent 
histories  of  modifications  made  to  the  dynamic  blackboard  structure  in  the  form  of  local 
contexts.  Knowledge  sources  are  also  permitted  to  mark  (tag)  arbitrary  fields  (or  nodes 
or  regions)  of  the  blackboard  itself  (without  requiring  continuing  access  rights  to  the 
field  being  tagged)  and  thereby  monitor  (in  a non-interfering  way)  those  locations  for 
subsequent  changes;  the  knowledge  source  will  then  be  sent  messages  should  any 
modifications  be  performed  upon  a tagged  fielu  Local  contexts  provide  knowledge 
sources  with  the  ability  to  create  a local  data  state  which  rejects  the  net  effects  of 
data  events  which  have  occurred  in  the  data  base  since  the  time  of  the  knowledge 
source’s  activation.  Combined  with  the  blackboard  data  tagging  capabilities,  local 
contexts  also  provide  a means  by  which  knowledge  sources  can  execute  quite 
independently  of  any  other  concurrently  executing  knowledge  sources  (and  without 
interfering  with  the  execution  progress  of  any  of  these  processes). 

In  an  attempt  to  improve  the  problem-solving  efficiency  of  a multiprocessor 
implementation  of  the  system  by  increasing  the  amount  of  potential  parallelism  from 
knowledge  source  activity,  the  logical  functions  of  precondition  evaluation  and 
knowledge  source  execution  are  split  into  separate  processing  entities  (celled,  of  course, 
precondition  and  knowledge-source  processes).  A precondition  process  is  responsible  for 
monitoring  and  accumulating  blackboard  data  events  which  might  be  of  interest  to  the 
knowledge  source  associated  with  the  precondition;  and  when  the  appropriate  data 
conditions  for  the  activation  of  the  knowledge  source  exist  in  the  blackboard,  the 
precondition  will  instantiate  a knowledge-source  process  based  on  its  associated 
knowledge  source,  giving  to  the  new  process  the  data  context  in  which  the  precondition 
was  satisfied. 

The  process  activity  of  HSII  is  intended  to  be  very  data-directed  in  nature, 
basing  the  decisions  as  to  whether  a knowledge  source  action  can  be  performed  on  the 
dynamic  data  state  represented  in  the  blackboard  data  base.  It  is  the  responsibility  of  a 
precondition  to  test  this  data  state  for  conditions  which  would  warrant  the  instantiation 


Parallelism  in  HS  II  79 


of  the  Knowledge  source  associated  witi  the  precondition.  The  activation  of  the 
precondition  itself  is  also  data-directed,  being  based  on  monitoring  for  the  more 
primitive  blackboard  modification  operations  which  Knowledge-source  processes  may 
invoke  tf  effect  the  results  of  their  computation.  This  blackboard  monitoring  is 
implemented  by  having  the  various  blackboard  modification  operators  be  responsible  for 
the  activation  of  preconditions  which  are  monitoring  for  data  events  being  caused  by 
the  modification  operation. 

In  order  to  indicate  the  nature  of  the  performance  of  the  HSII  organization 
when  run  in  a closely-coupled  multiprocessor  environment,  a simulation  system  was 
embedded  into  the  multiprocess  implementation  of  HSII  on  the  DECsystem-10.  While  the 
results  of  the  simulation  are  admittedly  based  on  a small  (but  computationally  expensive) 
set  of  sample  points,  they  have  generally  indicated  the  applicability  of  this  system 
organization  to  such  a hardware  architecture.  Given  the  Knowledge-based 
decomposition  of  a problem-solving  organization  as  prescribed  by  the  HSII  structure, 
effective  parallelism  factors  of  four  to  six  were  realized  even  with  a relatively  small  set 
of  precondition  and  knowledge-source  processes,  with  indications  that  up  to  twelve 
processors  could  be  totally  utilized,  given  appropriate  usage  (or  structuring)  of  the  data 
access  synchronization  mechanisms.  Experiments  thus  far  have  indicated  thet  careful 
use  of  the  locking  structure  is  required  in  order  to  approach  the  optimal  utilization  of 
any  given  processor  configuration  (unless  there  exist  so  many  ready  processes  that  the 
number  of  suspended  processes  does  not  matter  much,  as  is  the  case  in  configurations 
of  four  or  fewer  processors).  An  extended  use  of  non-interfering  tagging  seems  to  be 
indicated,  along  with  a reduction  in  the  use  of  region-locking  (perhaps  substituting 
region-examining  or  node-locking  wherever  possible).  Measurements  were  also  made  of 
various  system  level  primitive  operations  which  are  required  in  order  to  implement  the 
data-directed  multiprocess  structure  of  HSII.  While  all  these  results  are  of  a 
preliminary  nature  (and  hence  are  subject  to  variation  as  various  components  of  the 
given  implementation  are  improved  in  their  relative  efficiencies),  they  seem  to  indicate 
that  the  HSII  organization  is  indeed  applicable  tor  efficient  use  in  a closely-coupled 
multiprocessor  environment. 
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ABSTRACT 


The  Hearsay  II  speech  un  ierstanding  system  under  development  at  Carnegle- 
Mellon  University  is  a compie*,  distributed-logic  processing  system.  Processing  in  the 
system  is  effected  by  independent,  data-directed  knowledge  source  processes  which 
examine  and  alter  values  in  a global  data  base  representing  hypothesized  phones, 
phonemes,  syllables,  words,  and  phrases,  as  well  as  the  hypothetical  temporal  and 
logical  relationships  among  them.  The  question  of  how  to  schedule  the  numerous 
potential  activities  of  the  knowledge  sources  so  as  to  understand  the  utterance  in 
minimal  time  is  called  the  "focus  of  attention  problem".  Nea,  optimal  focusing  is 
especially  important  in  a speech  understanding  system  because  of  the  very  large 
solution  space  that  potentially  needs  to  be  searched.  Using  the  concepts  of  stimulus 
and  response  frames  of  scheduled  knowledge  source  instantiations,  competition  among 
alternative  <esponses,  goals.,  and  the  desirability  of  a knowledge  source  instantiation,  a 
general  atter.donal  control  mechanism  is  developed,  This  general  focusing  mechanism 
facilitates  the  experimental  evaluation  of  a variety  of  specific  attentlonal  control 
policies  (such  as  best-first,  bottom-up,  and  top-down  search  heuristics)  and  allows  the 
modular  addition  of  specialized  heuristics  for  the  speech  understanding  task. 


1 This  research  was  supported  in  part  by  the  Defense  Advanced  Research  Protects 
Agency  under  contract  no.  F44620-73-C-0074  and  monitored  by  the  Air  Force 
Office  of  Scientific  Research. 
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INTRODUCTION 


The  Hearsay  ll  (HSU)  speech  understanding  system  (Lesser,  et  a!.,  1974,  Erman  & 
Lesser  1975)  is  a complex,  distributed-logic  processing  system.  Inputs  to  the  system 
are'  temporal  sequence,  of  sets  of  acoustic  segments  and  associated  hypothesised 
labels  Diverse  sorts  of  speech  understanding  knowledge  are  encoded  in  severa 
currently)  independent  knowtedge  source  modules  (KSs),  which  include  one  or  more 
KSs  specific  to  each  of  the  following  knowledge  domains!  acoustic-phonetic  mappings, 
phone  expectation-realization  relationships,  syllable  recognition,  word  hypothecation 
and  syntax  and  semantics.  The  state  0.  processing  a.  any  point  in  limp  is  represented 
by  a global  data  base  (the  bb  onboard)  which  holds  in  an  integrated  manner  .11  of  the 
current  hypothesized  elects,  including  alternative  guesses,  ft*  the  v>n°us 
information  levels  ot  interpretation  (e.g,  segmental,  phonetic,  phonemic,  sylla  ic, 
and  phrasal).  In  addition,  any  interred  logical  0,  confirmatory  relationships  among 
various  hypotheses  are  represented  on  the  blackboard  by  weigh-ed  and  d. reded  links 
between  associated  hypotheses.  The  weight  and  direction  ot  a link  retted  he  degree 
to  which  the  hypothesis  at  the  tail  of  the  link  implies  (supports  or  confirms)  that  et  he 
head  The  blackboard  may  bo  viewed  as  a two-dimensional  problem  space,  where  the 
time  and  information  level  ot  a blackboard  hypothesis  serve  as  its  coord, nates  Such  . 
view  pern, its  cons, deration  ot  specif, c "areas"  ot  the  problem  space  and  enables  us  to 
speak  meaningfully  ot  hypotheses  in  the  "vicinity"  ot  a specitic  data  pattern. 

Processing  in  the  system  consists  of  additions,  alterations,  or  deletions  made  to 
data  on  th,  blackboard  by  the  various  KSs.  Each  KS  is  latr.directed.  i.e„  It  moo, tors 
the  blackb  ,ard  tor  arrival  ot  data  matching  its  Itieconditjon  pattern,  a particular 
pattern  ot  hypotheses  and  links  and  specitic  values  ot  their  attributes.  Whenever  ts 
precondition  is  matched,  the  KS  is  invoked  to  operate  separately  on  each  sa  is  y.ng 
data  pattern.  Finally,  when  the  KS  is  executed,  its  (arbitrarily  complox)  logic  is 
evaluated  to  determine  how  to  modify  the  data  bas.  in  the  vicinity  of  the  precondition 
pattern  that  triggered  the  invocation.  The  data  pattern  matching  th,  precondrhon  ot  a 
KS  will  be  denoted  as  the  stimulus  frame  <SF)  ot  the  invocation,  and  tha  chants, 
makes  to  tho  data  base  as  its  rejEOnse  frame  (RF).  Each  KS  may  be  schema!, zed  as  a 
production  rule  ot  the  form  [precondition  ->  .response!  Each  , nst.nl, at, on  is  then 
schematized  [SF  ->  RF],  re, testing  the  fact  that  the  RF  data  pattern  is  produced  m 
response  to  the  determination  that  the  SF  matches  the  rule's  precond,t,on.  Because 
th  -omplexity  ot  knowledge  source  processing,  a precise  detin, t, on  ot  tho  RF  c.nno 
be  directly  calculated  „om  the  stimulus  frame  without  the  .dual  execution  o,  the 
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Knowledge  soiree.  Howeve:-,  an  abstraction  of  the  RF  which  specifies  the  type  of 
changes  that  nay  be  made  <e.g.,  the  addition  of  a new  hypothesis  or  new  link,  the 
modification  of  a hypothesis’  validity,  etc.,)  and  the  general  vicinity  of  the  changes  can 
be  easily  calculated  directly  from  the  SF.  It  is  this  abstraction  of  the  RF  which  will  be 
used  in  further  discussions. 

As  is  well  Known  in  speech  understanding  research,  each  KS  is  imperfect.  At 
any  level  of  analysis,  a very  large  number  of  errors  may  be  introduced,  including 
misclassif  ications,  failures  to  recognize,  and  inappropria4e  "don’t  care"  responses  to 
what  is  actually  a significant  portion  of  the  utterance.  The  common  approach  in  speech 
understanding  research  is  to  construct  systems  which  can  recognize  utterances  in 
spite  of  such  errors  by  evaluating  many  weakly  supported  alternative  hypothesized 
interpretations  of  the  speech  simultaneously.  A practice1  consequence  of  this  parallel 
evaluation  of  numerous  alternatives  is  that,  at  any  point  in  time,  a great  number  of  KS 
applications  are  warran'ed  by  the  existence  of  hypothetized  interpretations  matching 
the  various  KS  preconditions.  One  object  of  attentional  control  is  to  schedule  the 
numerous  potential  activities  of  the  KSs  to  prevent  the  intractable  combinatorial 
explosion  which  would  inevitably  result  from  an  unconstrained  application  of  KSs. 
Wore  specifically,  the  focus  of  attention  problem  is  defined  to  be  that  of  developing  a 
method  for  minimizing  the  total  number  of  KS  executions  (or  total  processing  time) 
necessary  to  achieve  «:n  arbitrarily  low  rate  of  error  in  the  semantic  interpretation  of- 
utterances. 

The  standard  approach  to  the  focus  of  attention  problem  in  other  speech 
systems  employing  diverse,  cooperating  sources  (Reddy,  et  at.,  1973;  Paxton  and 
Robinson,  1975;  Woods,  1974)  is  based  on  an  explicit  control  strategy.  In  these 
explicit  control  strategies,  there  is  a centralized  focusing  module  which  carries  out  two 
functions  using  a built-in  set  of  speech-specific  rules:  (1)  for  defining  an  explicit 
sequence  of  calls  to  a predefined  set  of  knowledge  sources  and  then  evaluating  their 
responses  in  order  to  determine  the  suital.  lity  of  a hypothesized  phrase  (partial  parse 
of  the  utterance);  and  (2)  for  deciding  which  of  many  alternative  partial  parses  of  the 
utterance  should  be  furlher  evaluated.  This  explicit  control  strategy  is  inappropriate 
in  the  HSII  framework  because  it  destroys  the  data-directed  nature  and  modularity  of 
knowledge  source  activity.  In  the  HSII  system,  KSs  can  be  easily  removed  or  added, 
and  their  input  and  output  characteristics  changed  without  effecting  other  knowledge 
in  the  system.  There  is  also  a more  fundamental  argument  against  an  explicit  control 
strategy  in  a problem-solving  system  that  uses  a large  number  of  diverse  sources  of 
knowledge:  this  explicit  strategy  requires  the  use  of  built-in  knowledge  about  the 
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specific  characterises  of  Knowledge  sources.  In  this  case,  it  seems  that  the  explicit 
sequential  logic  necessary  to  get  the  appropriate  interactions  among  the  Knowledge 
sources  in  all  the  possible  different  data  patterns  will  become  very  difficult  to 
predetermine  and  code. 

The  approach  taKen  in  HSH  to  focus  of  attention  does  not  use  any  explicit  (pre- 
compiled) information  about  which  knowledge  sources  currently  are  contained  in  the 
system,  nor  their  processing  characteristics;  this  approach  is  more  implicit  (i.e., 
mechanistic,  uniform,  and  data-direded);  it  relies  more  on  general  task  independent 
focusing  strategies  than  on  speech-specific  ones.  It  should  also  be  noted  that,  as  part 
of  these  more  general  focusing  strategies  employed  in  HSII,  a uniform  mechanism  has 
been  incorporated  which  allows  a knowledge  source  to  contribute  speech-specific 
focusing  information  through  modifications  to  the  blackboard.  In  this  way,  speech- 
specific  focusing  information  can  be  exploded  without  destroying  the  modularity  and 
the  data-directed  nature  of  knowledge  source  control  in  the  HSII  systems  fi  jmework. 

The  remainder  of  this  paper  is  divided  into  four  sections.  In  the  next  section,  a 
number  of  underlying  principles  for  effective  focusing  and  related  processing  control 
mechanisms  are  described.  Subsequently,  in  the  section  on  "Additional  Mechanisms  for 
Precise  Focusing,"  addilional  objectives  for  focusing  are  discussed  and  related 
mechanisms  for  their  atlainment  are  presented.  The  section  on  "Alternative  Policies 
for  Focus  of  Attention  describes  how  these  techniques  permit  experimentation  with  a 
variety  of  attentional  control  policies,  such  as  purely  bottom-up,  purely  top-down,  and 
hybrid  analyses.  Finally,  tentative  conclusions  are  discussed  in  the  last  section. 


FUNDAMENTAL  PRINCIPLES  AND  MECHANISMS 

One  can  view  the  focusing  problem  as  a complex  resource  allocation  problem. 
Foi  example,  consider  the  expenditure  of  money  on  alternative  search  devices  in  a 
hunt  for  oil.  The  alternative  explorers  and  devices,  including  seismologists,  geologists, 
drillii  g teams,  and  satellite  reconnaissance,  are  the  knowledge  sources  of  the  task. 
Each  produces  its  response  data  only  with  significant  cost  and  with  a substantial 
probability  of  error,  and  there  are  sequencing  Constraints  which  require  some  KSs  to 
delay  their  processing  until  other  KSs  terminate  theirs  and  then  only  if  particular 
findings  are  obtained.  How  should  one  invest  in  their  potential  contributions?  Five 
fundamental  principles  have  been  identified  for  the  control  of  processing  in  such  tasks, 
and  these  are  listed  below.  Each  of  these  principles  is  used  to  define  a separate 
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measure  for  evaluating  the  importance  that  should  be  attached  to  each  KS  invocation 
that  has  not  yet  been  executed  these  measure,  that  are  associated  with  each  KS 
invocation  are  not  necessarily  constant  for  the  lifetime  of  the  invocation  but  may  need 
to  be  dynamically  recalculated  as  the  state  of  the  blackboard  chants  in  the  general 
vicinity  of  KS’s  stimulus  and  response  ' ••  ne.  A function  based  on  these  measures  Is 
then  used  to  associate  a priority  to  each  KS  invocation 

(1)  The  competition  principle:  .the  besi  od  several  alternatives  should  be 
performed  first.  This  principle  governs  how  ordering  decisions  should  be  made  among 
several  behavioral  options  which  are  competitive  in  the  sense  that  a successful 
outcome  of  one  obviates  performing  another.  For  example,  consider  the  problem  of 
determining  whether  oil  exists  at  site  A and  suppose  that  the  functions  of  a geologist 
and  seismologist  are  substitutable  vis-a-vis  this  objective.  If  either  the  seismologist  or 
geologist  has  already  performed  and  positively  indicated  the  presence  or  absence  of 
on,  that  result  obviates  employing  the  other  scientist  to  perform  an  equivalent 
function.  In  this  sense,  it  can  be  said  that  the  previous  result  competes  with  the  yet- 
to-be-performed  alternative;  that  is,  the  former  response  is  at  a higher  level  of 
analysis  in  the  same  area  of  the  problem  space  as  is  the  alternative  action.  However, 
i/  oil  on  site  B can  be  determined  only  by  seismologiH  techniques,  hiring  a geologist 
for  site  A does  not  compete  with  hiring  a seismologist  for  site  B,  acco  d ng  to  this 

principle. 

(2)  The  validity  principle:  more  processing  should  be  gjyen  to  K£s  operating  on 

more  valid  data.  This  principle  says  that,  everything  else  constant,  one  KS  invocation 
should  be  preferred  to  another  if  the  former  is  working  on  data  which  is  more 
credible.  In  an  oil  hunt,  it  would  be  preferred  to  employ  as  a predictor  the  one 
seismologist  whose  seismological  readings  were  most  accurate.  Similarly,  in  the  speech 
domain,  various  KSs  will  be  invoked  to  contribute  to  the  interpretation  of  specific  data 
patterns  on  the  blackboard.  Each  hypothesis  in  a SF  will  contain  a rating  of  its  validity 

derived  from  the  validities  and  implications  of  hypotheses  linked  to  it.  Thus,  this 

principle  implies  that  the  KSs  invoked  to  work  on  the  most  valid  SFs  are  most 
preferred.  Once  these  KSs  have  performed,  the  hypotheses  in  their  responses  will 
also  be  rated  for  validity  and  will,  in  general,  derive  their  validity  directly  from  the 
hypotheses  in  the  SF.  By  preferring  KS  invocations  with  the  most  credible  SFs,  the 
system  tends  to  maximize  the  validity  of  its  responses, 

(3)  The  significance  principle:  more  processing,  should  be  given  to  K£s  whose 

RFs  are  more  significant.  This  principle  aims  at  insuring  that  when  a variety  of 

behaviors  can  be  performed,  the  most  important  are  done  first.  For  example,  while 
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filing  a claim  on  land  and  drilling  are  both  necessary  prerequisites  for  successful 
completion  of  an  oil  hunt,  at  the  outset  of  prospecting  the  former  is  the  more 
important  and  should  be  done  first.  As  an  example  ir.  the  speech  domain,  a situation 
might  arise  where  a sequence  of  pnones  could  be  either  recognized  as  a word  or 
subjected  to  analysis  for  coarticulation  effects.  The  first  of  these  two  actions  is  more 
important  and,  on  a priori  terms,  should  be  performed  first.  One  heuristic  in  the 
speech  understanding  domain  for  defining  significance  is  to  give  preference  to  KS 
invocations  which  are  operating  at  the  highest  levels  of  analysis  within  any  portion  of 
the  utterance  (closest  to  a complete  parse  interpretation).  A more  general  statement 
of  this  heuristic  is  that  preference  should  be  given  to  the  KS  invocation  whose  RF  can 
potentially  produce  a result  which  is  closest  in  terms  of  information  level  to  the  overall 
goal  of  the  problem  solver. 

(4)  The  efficiency  principle:  more  processing  should  be  given  tg  KSs  which 
perform  most  reliably  and  inexpensively.  Obviously,  if  one  geologist  is  more  reliable 
'han  another  and  the  two  charge  the  same  for  their  services,  the  former  should  be 
preferred.  Conversed  of  two  equally  reliable  geologists,  one  should  prefer  the  less 
expensive.  Similarly,  in  the  speech  domain,  many  KS  applxations  are  mor?  efficient 
than  others  and  should  be  preferred.  As  an  example,  a bottom-up  word  hypothesizer 
is  found  to  be  more  accurate  at  generating  word  hypotheses  than  is  the  top-down 
syntax  and  semantics  KS.  Everything  else  equal,  two  invocations  of  these  KSs  whose 
response  frames  consist  of  new  word  hypotheses  should  be  scheduled  so  that  the 
bottom-up  hypothesizer  is  first  executed. 

(5)  The  goal  satisfaction  principle:  more  processing  should  be  given  to  KSs 

whose  responses  are  most  likely  to  satisfy  processing  goals.  The  oil  hunt  managers 
might  establish  a goal  of  determining  the  depth  of  water  at  site  A.  This  would  induce 
additional  preference  for  those  agents  (e.g.,  the  seismologists  and  drillers)  whose 
ordinary  activities  could  concomitantly  satisfy  this  additional  goal.  In  the  speech 
domain,  similar  circumstances  arise:  the  priority  of  a KS  which  can  potentially 

generate  new  word  hypotheses  in  a particular  time  region  of  the  utterance  should  be 
increased.  This  desire  for  a specific  typo  of  processing  is  specified  in  HSII  by 
establishing  a goal  on  the  blackboard  which  represents  the  time  and  level  of  the 
desired  hypotheses.  KS  instantiations  whose  RFs  match  the  processing  specified  in  the 
goal  are  made  more  desirable.  More  generally,  KS  invocations  may  be  evaluated  as 
more  or  less  likely  to  help  satisfy  each  specific  goal.  The  higher  the  probability  that  a 
KS  invocation  will  contribute  to  the  satisfaction  of  a goal  and  the  greater  the  utility  of 
tie  goal,  the  more  desirable  its  execution  becomes.  Through  this  mechanism  of  adding 
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goals  to  the  blackboard,  a knowledge  source  can  dynamically  introduce  task  specific 
focusing  rules  into  the  focusing  algorithm,  Since  KS  activity  is  date-directed,  this 
focusing  policy  KS  would  execute  only  when  the  data  patterns  indicating  the  need  for 
a specific  focus  action  occur, 

The  preceding  five  principles  provide  the  theoretical  foundation  for  our 
attentional  control  system,  A number  of  sophisticated  control  mechanisms  have  been 
created  which  provide  the  tools  by  which  these  principles  can  be  converted  into 
operational  focusing  policies.  These  mechanisms  are  discussed  in  the  remainder  of  this 
section. 

In  order  to  evaluate  the  prefe 'ability  of  one  KS  invocation  vis-a-vis  the  others, 
the  five  control  principles  require  a number  of  ordering  relationships  to  hold.  In 
overview,  the  major  operational  principle  for  focusing  is  to  schedule  for  earliest 
execution  the  KS  invocation  which  is  the  most  desirable  according  to  the  five  rules 
provided.  The  focusing  mechanism  first  evaluates  the  desirability  of  each  KS 
invocation  as  a measure  of  the  degree  to  which  it  satisfies  the  various  objectives  of 
the  system  and  then  executes  the  most  desirable  first  (with  an  appropriate 
generalization  for  executing  several  KSs  simultaneously  in  a multiprocessing  system). 
Thus,  the  major  subproblem  in  the  construction  of  a focuser  is  the  estimation  of  a KS 
invocation’s  desirability.  How  this  desirability  is  computed  will  now  be  desc'ibed. 

Each  KS  invocation  is  characterized  by  a number  of  attrioules.  Its  SF  has  a 
credibility  value  (between  -100  and  +100)  which  estimates  the  likelihood  that  the 
detected  pattern  of  hypotheses  and  links  is  valid  and  satisfies  the  KS’s  precondition 
(negative  values  imply  evidence  against  this  possibility).  The  credibility  value  of  a SF 
is  determined  as  a function  of  the  validity  ratings  on  each  of  the  hypotheses  in  the  SF. 
Ar-  previously  indicated,  these  ratings  themselves  are  determined  from  the  strength'.-,  of 
implications  on  links,  the  original  probabilities  assigned  to  each  of  the  acoustic  segment 
labels  provided  as  input  (i.e.,  the  lowest  level  hypotheses  in  the  blackborrd),  and  tno 
derived  validity  ratings  of  intermediate  level  hypotheses.  In  our  c •*v.nt 

implementation,  the  credibility  of  the  SF  is  taken  to  be  the  maximum  of  the  • dity 

ratings  of  the  hypotheses  in  the  SF  (ranging  from  -100  to  +100). 

Each  KS  invocation  can  be  thought  of  as  a transformation  oi  tbp  SF  into  che  RF. 
Associated  with  the  KS  invocation  then  is  the  estimated  ievel(s)  (e.&  phe  letic,  word, 
phrasal)  of  the  RF,  the  estimated  validity  of  the  RF  hypoih  jses,  the  e .’  mater'  tbr.i? 

(i  e,,  location  and  duration)  of  any  newly  created  RF  hypothese*-.  Eccii  vhe$e 

estimated  values  contributes  to  an  appraisal  of  the  significance  end  probable 

correctness  of  the  RF  which  the  KS  will  produce. 
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The  objectives  of  the  significance,  efficiency,  and  goal  satisfaction  principles  can 
be  achieved  if  the  desirability  of  a KS  invocation  is  computed  by  any  increasing 
function  of  the  credibility  of  iis  §F,  the  estimated  reliabihty  of  the  KS  (to  produce 
correct  RFs  of  the  form  it  anticipates),  and  the  estimated  level,  duration,  and  validity  of 
RF  hypotheses.  The  objective  of  the  validity  principle,  to  operate  on  most  valid  dfeta 
first,  is  accomplished  oy  making  desirability  an  increasing  function  of  the  credibility  of 
the  SF.  The  objective  of  the  significance  principle,  to  perform  the  most  significant 
behaviors  first,  is  achieved  by  making  desirability  an  increasing  function  of  the  level 
and  duration  of  RF  hypotheses.  Since  hypotheses  closest  to  complete  utterance 
interpretations  will  be  at  the  highest  level  and  span  the  entire  duration  of  the  speech, 
actions  which  can  produce  such  hypotheses  or  support  them  will  be  most  preferred. 
The  objective  of  the  efficiency  principle,  to  prefer  KSs  which  perform  best,  is  achieved 
by  making  desirability  an  increasing  function  of  the  KSs  reliability  (per  unit  "cost"  or 
time). 

To  understand  how  the  other  objectives,  the  preference  of  the  competition 
principle  for  avoiding  computation  of  obviated  behaviors  and  the  goal-directed 
scheduling  dictated  by  the  goal  satisfaction  principle,  are  achieved  in  the  system,  it  Is 
necessary  to  introduce  a number  of  additional  concepts,  The  mechanisms  required  to 
operationalize  the  desired  effects  of  competition  will  be  considered  first. 

The  first  objective  of  the  focuser  is  to  insure  that  the  understanding  system 
moves  quickly  to  a complete  interpretation  of  the  speech  and,  in  particular,  avoids 
apparently  unnecessary  computation.  Specifically,  if  any  KS  invocation  is  expected  to 
produce  a RF  which  is  in  *he  same  time  range  as  an  existing,  higher  level,  longer 
duration,  and  more  credible  hypothesis,  its  activity  is  potentially  useless.  It  is 
therefore  less  preferred  than  the  action  of  a KS  which  is  expected  to  produce  higher 
level,  more  expansive,  and  more  credible  interpretations  of  the  utterance  than  those 
that  currently  exist.  Thus,  HSII  uses  a statistic  called  the  state  of  the  blackboard;  this 
is  a single-valued  function  of  each  time  value,  from  the  beginning  of  an  utterance  to  its 
end.  The  state  S(t)  for  some  point  (time)  t in  the  utterance  is  the  maximum  of  the 
values  V(h)  of  all  hypotheses  which  represent  interpretations  containing  the  point  t. 
The  v'llue  of  a hypothesis  is  an  increasing  function  of  its  level,  duration,  and  validity. 
Hius,  the  highest  possible  value  for  a hypothesis  would  be  that  associated  with  the 
hypothesis  representing  a complete  parse  of  the  entire  utterance  with  a validby  rating 
of  +100  (the  maximum).  To  the  extent  that  the  utterance  is  partially  parsec  in  some 
interval  [tl,t2],  will  the  state  S(t)  be  high  in  this  region.  Thus,  S(t)  provides  a single 
metric  for  evaluating  the  current  success  of  the  understanding  process  over  each  area 
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of  the  utterance  From  a more  general  viewpoint,  the  metric  V(h)  indicates  how  close  a 
hypothesis  h is  to  the  desired  overall  goal  state;  and,  the  metric  S measures  both  what 
aspect  of  the  overall  goal  has  been  solved  (e.g.,  in  the  case  of  speech,  what  time 
interval)  and  how  good  is  the  solution  (e.g.,  in  the  case  of  speech,  the  validity  of  the 
hypothesis  and  how  close  in  terms  of  information  level  it  is  to  the  sentential  phrase). 

It  is  very  easy,  using  S(t),  to  decide  whether  a prospective  action  is  likely  to 
improve  on  the  current  state  of  understanding.  If  the  estimated  value  V(h)  of  a RF 
hypothesis  h exceeds  S(t)  anywhere  in  the  corresponding  interval,  the  KS  invocation 
should  be  considered  very  desirable;  otherwise  it  should  be  inhibited  by  the  existing 
more  valuable,  competitive  hypotheses.  This,  in  short,  is  how  the  objective  of  the 
competilion  principle  is  accomplished.  In  addition  to  its  dependence  upon  the  variables 
already  considered,  the  desirability  of  a KS  invocation  is  made  to  be  an  increasing 
function  of  the  ratio  of  the  maximum  of  the  estimated  value  of  the  RF  hypotheses  to 
the  current  state  S(t'  (where  S(t)  is  taken  to  be  the  minimum  over  the  Interval 
corresponding  to  the  time  location  of  the  RF).  In  this  way,  preference  is  given  to  KS 
invocations  which  are  expected  to  improve  the  current  state  of  understanding. 

One  can  think  of  S(t)  as  defining  a surface  whose  height  reflects  the  degree  of 
problem  solulion  in  each  area.  In  this  conception,  operations  which  would  yield  results 
below  the  surface  are  undesirable  (unnecessary),  and  those  which  would  raise  the 
surface  are  preferred. 

The  last  objeclive  to  be  operationalized  is  that  of  the  goal  satisfaction  principle. 
In  general,  a goal  may  specify  that  particular  types  of  hypotheses  are  to  be  created 
(e.g.,  create  word  hypotheses  between  times  tp  and  t|)  or  existing  hypotheses 
modified  in  desired  ways  (eg.,  attempt  to  reject  the  hypothesized  word  ’no"  between 
t3  and  t/j  by  establishing  discontinuing  relationships  between  it  and  the  acoustic  data). 
Two  types  of  adjustments  are  made  to  the  desirability  ratings  of  KS  invocations  based 
on  their  relationships  to  such  goals.  The  first  cafi>  arises  when  there  is  direct  goal 
satisfaction,  meaning  that  a KS  invocation  is  a possible  candidate  for  solving  a goal 
because  its  RF  matches  the  desired  attributes  of  the  goal.  In  this  case,  the  desirability 
of  the  KS  invocation  is  increased  by  an  amount  proportional  to  the  utility  of  the  goal 
(the  degree  to  which  it  is  held  to  be  important  when  it  is  created). 

The  second  type  of  effect  is  the  result  of  indirect  goal  satisfaction.  In  this  case, 
a KS  invocation  does  not  directly  satisfy  a goal  but  apparently  increases  the 
probability  that  it  will  be  solved  by  producing  some  result  which  is  held  to  be  partially 
useful  for  the  achievement  of  the  main  goal.  Two  ivpes  of  indirect  goal  satisfying 
actions  can  be  identified.  First,  there  is  goal  reduction:  a KS  invocation  generates 
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subgoals  whose  solution(s)  will  entail  sal isf action  of  the  original  goal.  For  example,  as 
the  result  of  recognizing  the  sequence  "The  (gap)  dog,"  the  system  might  establish  a 
goal  for  the  recognition  of  an  adjective  between  the  two  recognized  words  to  replace 
the  gap  in  understanding.  Subsequently,  some  KS  might  establish  several  disjunctive 
subgoals  related  to  this  one,  such  as  goals  for  recognizing  the  words  "shaggy,  cute, 
"sleepy,"  etc.  Because  the  satisfaction  of  any  one  of  these  would  constitute 
satisfaction  of  the  original  objective,  the  KS  invocation  indirectly  satisfies  the  original 
goal.  Its  desirability  is  less  than  that  of  a KS  invocation  directly  satisfying  the  same 
goal,  but  may  be  more  than  other  KSs. 

The  second  type  of  indirect  goal  satisfaction  occurs  when  a KS  invocation 
approaches  a goal  by  producing  a RF  which  is  close  to  the  goal  but  does  not  quite 
satisfy  it.  For  example,  in  the  context  of  the  preceding  "adjective"  goal,  a general 
increase  in  the  activily  of  knowledge  sources  which  generate  and  improve  phone 
hypotheses,  syllable  hyootheses,  and  phrasal  hypotheses  in  the  area  of  interest  will  be 
more  or  less  proximate  to  the  desired  response.  Since  each  KS  is  schematized  as  a 
rule  of  the  form  [precondition  «>  response],  a means-ends  analysis  can  be  performed 
to  estimate  the  probability  that  some  KS  invocation  will  produce  a response 
contributing  to  the  ultimate  solution  of  a goal.  The  more  closely  its  RF  approaches  the 
desired  goal,  the  higher  is  the  probability  that  execution  of  a KS  invocation  will 
contribu  e to  the  goal’s  ultimate  satisfaction  and  the  greater  the  desirability  of  the  KS 
invocation. 

In  summary,  the  desirability  of  a KS  invocation  is  defined  to  be  an  increasing 
function  of  the  following  variables:  the  estimated  value  of  its  RF  (an  increasing 
function  of  the  reliability  of  the  KS  and  the  estimated  level,  duration,  and  validity 
credibility  of  the  hypotheses  to  be  created  or  supported);  the  ratio  of  the  estimated 
RF  value  to  the  minimum  current  state  in  the  time  region  of  the  RF;  and,  the  probability 
that  the  KS  invocation  will  directly  satisfy  or  indirectly  contribute  to  the  satisfaction  of 
a goal  as  well  as  the  utility  of  the  potentially  satisfied  goal.  Scheduling  KS  invocations 
according  to  their  desirabilities  then  accomplishes  the  objectives  established  by  the 
preceding  five  basic  principles.  However,  there  are  some  inadequacies  of  such  a basic 
attentional  control  mechanism;  these  are  considered  in  the  next  section. 


ADDITIONAL  MECHANISMS  FOR  PRECISE  FOCUSING 


Basically,  while  the  five  fundamental  principles  appear  correct  and  universally 
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applicable,  they  are  not  complex  enough  to  provide  precise  control  in  all  of  the 
siluations  that  arise  in  a complex  distributed-logic  understanding  system.  Three 
additional  issues  are  now  introduced,  and  the  control  mechanisms  currently  used  to 
handle  these  are  discussed.  Tire  topics  considered  include  dynamically  modifiable 
recognition  and  output  generation  thresholds  on  KS  logic;  an  implicit  goal  state 
(approximalely  the  inverse  of  the  current  state  S(t)>  which  can  be  used  to  determine 
the  desired  balance  between  depth-first  and  breadth-first  approaches  to  the 
understanding  problem;  and  melhods  for  avoiding  "false  peaks"  or  "cognitive  fixedness" 
in  the  recognition  process. 

Nearly  all  KS  behavior  can  be  separated  into  two  components:  a pattern 
recognition  component  and  an  oulput  generation  component.  For  example,  a word 
hypothesizer  may  look  for  patterns  of  phones  (pattern  recognition)  in  order  to 
produce  a new  word  hypothesis  (output  generation),  Both  components  operate  in 
fuzzy,  errorful  ways.  In  the  pattern  recognition  component,  the  KS  must  accept  fuzzy 
matches  of  its  templates  because  that  is  the  nature  of  speech  recognition.  Conversely, 
the  word  hypotheses  it  generates  are  necessarily  probabilistic.  The  probable 
correctness  of  its  hypotheses  are  then  reflected  by  validity  ratings  or  implication 
^■eights  on  its  outputs.  Thresholding  occurs  in  such  processes  in  two  ways.  First,  the 
degree  of  fuzziness  tolerated  in  pattern  matching  is  arbitrarily  set  to  some  moderate 
criterion  to  prevent  an  intractably  large  number  of  apparent  matches.  Second,  the 
strengths  of  the  output  responses  are  measured  against  some  threshold  to  insure  that 
only  sufficiently  credible  responses  are  produced.  The  credibility  of  the  response 
may,  in  addition  to  its  dependence  upon  the  credibility  of  the  stimulus  frame,  also  be 
dependent  upon  the  type  of  inference  method  used  to  generate  a response.  For 
example,  the  word  recognizer  might  employ  a distance  metric  for  recognition  and 
classification,  in  which  case  the  credibility  of  the  output  word  is  a decreasing  function 
of  the  distance  between  the  stimulus  phones  and  the  phones  of  the  most  similar  word 
template.  Responses  which  are  too  weak  vis-a-vis  this  second  threshold  are  held  in 
abeyance  rather  than  being  produced  or  forgotten. 

Nov/  the  general  scheme  of  the  robust  overall  policy  that  is  employed  can  be 
sketched.  At  the  beginning  of  an  analysis,  relatively  high  thresholds  are  specified  for 
pattern  matching  goodness  and  output  goodness.  Processing  continues  based  on  the 
other  scheduling  principles  until  thresholds  are  changed  (discussed  below).  When  a 
threshold  change  occurs,  it  may  be  specific  to  certain  levels  or  time  regions  of  RFs  or 
to  the  types  of  KSs  used  to  produce  them.  As  an  example,  if  all  of  the  utterance  were 
correctly  understood  except  the  first  word,  we  would  set  very  low  thresholds  for 
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behavior  for  all  KSs  in  the  beginning  portion  of  the  utterance.  Our  current  policy,  in 
specific,  lowers  thresholds  most  in  poorly  understood  areas  adjacent  to  areas  which 
are  well  understood.  When  an  arbilrary  level  of  desirability  is  no  longer  achieved  by 
any  of  the  pending  KS  invocations,  the  important  areas  for  threshold  lowering  are 
identified  by  finding  valleys  next  to  peaks  in  the  state  function  S(t).  The  thresholds  In 
these  areas  are  lowered  in  the  hope  that  greater  error  tolerance  there  will  produce 
additional  results  which  can  be  usefully  integrated  with  the  adjacent,  more  reliable 
interpretations  previously  produced. 

Without  dynamically  modifiable  pattern  match  and  output  goodness  thresholds,  a 
speech  understanding  system  would  necessarily  embody  numerous  parameters  whose 
values  were  determined  at  the  outset  for  all  problem  tasks.  Such  a system  would 
probably  be  very  sensitive  to  the  particular  values  chosen.  Our  approach,  however, 
insures  that  each  of  the  KSs  can  be  encouraged  to  perform  more  work  in  any  area  of 
the  blackboard  by  simply  lowering  two  general  sorts  of  control  variables.  This  is  seen 
as  a fundamentally  important  control  principle  relating  tr  the  controllability  of  the 
generative  aspect  of  KSs  per  se  rattier  than  to  their  comparative  expected  responses. 

The  second  additional  concept  which  is  utilized  in  the  focuser  is  that  of  the 
LmjoJidt.  goa]  state  or  I(t).  It  is  only  a slight  oversimplification  to  think  of  I(t)  as  the 
inverse  of  the  current  state  S(t).  To  the  extent  that  S(t)  is  large  (representing  the  fact 
that  the  portion  of  the  utterance  adjacent  to  t has  been  highly  successfully  analyzed), 
I(t)  will  be  small.  A small  I(t)  value  means  that  there  is  little  to  be  gained  by  trying  to 
improve  the  understanding  around  t.  Conversely,  a large  I(t)  means  that  the  portion  of 
the  utterance  in  the  neighborhood  of  t greatly  needs  additional  analysis.  As  a result, 
one  might  suppose  that  KSs  operating  in  that  region  should  be  conceived  as  satisfying 
an  implicit  goal  of  raising  the  level  of  understanding  (the  surface  of  the  current  state 
o(t))  wherever  it  is  lowest.  In  fact,  the  best  role  for  the  implicit  goal  state  is  probably 
as  a weak  contributor  to  the  desirability  of  a KS  invocation.  If  remains  an  empirical 
question  whether  it  is  better  to  work  in  the  regions  of  the  highest  peaks  in 
understanding  (depth-first)  or  more  evenly  throughout  the  entire  utterance  (breadth- 
lirst).  Although  an  optimal  strategy  is  not  known,  it  is  clear  that  in  computing  the 
desirability  of  a KS  invocation,  the  estimated  value  of  the  RF  and  the  ratio  of  the  RF 
value  to  the  minimum  of  S(t)  in  the  same  region  are  two  contributing  factors  whose 
relative  weightings  can  be  experimentally  manipulated  to  achieve  exactly  that  balance 
between  depth-first  and  breadth-first  which  is  desired. 

As  is  well  known  in  problem  solving  and  search  paradigms,  there  is  a constant 
danger  of  getting  trapped  on  "false  peaks,"  as  when  one  bases  actions  on  the  apparent 
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correctness  of  highly  rated  but  ultimately  incorrect  interpretations,  A number  of  the 
preceding  focusing  principles  have  been  formulated  to  insure  that  processing  in  the 
region  of  highly  valued  hypotheses  is  facilitated  at  the  expense  of  other  potential 
actions;  a consequence  of  this  paradigm  is  that  the  focuser  must  take  precautions  to 
prevent  the  "cognitive  fixedness"  which  would  be  apparent  if  the  focuser  failed  to 
abandon  those  paths  which  lead  nowhere.  This  is  done  in  the  focuser  in  a simple 
manner.  The  highest  peak  in  understanding  at  any  point  t in  the  utterance 
corresponds  to  the  highest  valued  hypothesis  in  that  region,  and  its  value  is  just  S(t). 
Thus,  stagnation  of  the  understanding  process  in  a region  can  be  detected  whenever 
$(t)  fails  to  increase  for  a prolonged  time.  While  preference  should  still  be  given  to 
the  execution  of  KS  invocations  working  on  the  surface  of  S(t)  and  promising  to 
increase  its  value,  the  focuser  must  conclude  that  other  KS  invocations  should  now 
become  more  desirable  than  they  previously  seemed,  because  they  at  least  may 
improve  the  analysis  in  the  stagnant  area.  This  is  accomplished  by  increasing  the 
implicit  goal  state  I(t)  whenever  S(t)  is  stagnant  for  a specified  length  of  time.  As  a 
result  of  increasing  I(t),  KS  invocations  operating  near  the  surface  of  S(t)  and 
previously  viewed  as  marginally  desirable  become  sufficiently  desirable  to  be 
executed.  If  any  one  of  them  succeeds  in  increasing  S(t),  I(t)  is  promptly  reset  to  be 
the  inverse  of  S(t).  However,  each  time  S(t)  stagnates  for  the  specified  duration,  I(t)  is 
again  increased.  Thus,  false  peaks  are  avoided  by  actually  recognizing  the  behavirval 
characteristics  of  cognitive  fixedness:  as  long  as  the  degree  of  its  understanding 
remains  stagnant,  it  continually  increases  the  desirability  of  the  competing  KS 
alternatives  which  previously  appeared  to  be  suboptimal  in  the  area  of  stagnation. 


ALTERNATIVE  POLICIES  FOR  FOCUS  OF  ATTENTION 

To  this  point,  general  principles  for  focusing  and  mechanisms  to  achieve  the 
realization  of  these  principles  have  been  described.  However,  there  still  remains  a 
wide  variety  of  policies  which  can  be  superimposed  upon  these  mechanisms  in  a 
manner  consistent  with  them  but  prescribing  a specific  global  search  strategy  to  be 
employed  in  speech  understanding.  This  flexibility  is  considered  one  of  the 
outstanding  virtues  of  the  focuser  design  since  it  affords  the  possibility  for  empirical 
evaluation  of  alternative  focus  of  attention  policies.  In  this  section,  a number  of  these 
policies  are  identified,  and  it  is  shown  how  each  of  these  can  be  easily  effected  within 
our  system.  Each  policy  described  would  be  effected  by  one  or  more  policy  modules,  a 
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KS -like  program  which  is  activated  whenever  specific . conditions  of  interest  are 
detected.  This  will  be  clarified  by  the  examples  below. 

Consider  the  policy  which  dictates  that,  whenever  possible,  understanding  is  to 
proceed  bottom-up,  from  the  acoustic  segments  to  the  phrasal  level.  Such  a policy 
would  be  effected  as  follows.  At  the  utset  the  policy  module  would  set  a goal  with 
infinite  positive  utility  for  RFs  at  the  lowest  level  and  a goal  with  infinite  negative 
utility  for  RFs  at  higher  levels.  When  the  system  became  quiescent,  the  policy  module 
would  be  reinvoked  by  the  system,  Its  response  would  be  to  modify  the  goals  so  that 
processing  at  the  two  lowest  levels  would  be  facilitated  and  all  others  inhibited.  This 
process  would  continue  until  the  highest  level  was  facilitated.  At  any  particular  point 
in  the  analysis,  processing  would  be  restricted  to  several  of  the  lowest  levels  and 
would  move  upward  one  level  at  a time  as  all  the  potential  activity  at  a lower  level  had 
been  completed.  Similarly,  a purely  top-down  analysis  could  be  controlled  in  the  same 
way,  substituting  "highest"  for  "lowest",  etc'. 

Under  ordinary  circumstances,  using  only  the  mechanisms  detailed  in  the 
previous  sections,  a hybrid  analysis  will  occur.  While  there  is  increased  desirability 
associated  with  RFs  at  the  highest  levels,  it  is  to  be  expected  that  sometimes  there  will 
be  areas  of  the  utterance  where  all  desirable  KS  invocations  will  be  at  low  levels  while 
in  other  areas  they  will  be  primarily  at  higher  levels. 

A left-to-right  analysis  can  be  accomplished  using  goals  in  the  same  way  as  for 
the  purely  bottom  up  or  top-down  methods.  Here,  every  time  quiescence  occurs,  the 
processing  from  the  beginning  of  the  utterance  to  a point  further  along  in  time  is 
facilitated.  This  would  continue  until  the  whole  utterance  was  facilitated  by  a goal, 
Right-to-left,  obviously,  is  similarly  controlled,  Note  too  that  "more  or  less"  left-to- 
right  search  can  be  accomplished  by  specifying  less  than  infinite  goal  utilities  and  by 
defining  "quiescence"  to  mean  that  the  desirabilities  of  all  KS  invocations  are  below 
some  policy  threshold  for  minimally  acceptable  desirability. 

Perhaps  one  of  the  most  important  types  of  empirical  comparisons  to  be  studied 
is  the  breadth  vs.  depth-first  alternatives,  Breadth-first  is,  theoretically  speaking, 
advantageous  when  KSs  are  capable  of  looking  at  broad  contexts  and  optimizing  their 
outputs  on  the  basis  of  more  information  than  is  used,  for  example,  by  simple 
grammatical  rewriting  ruks.  Similarly,  if  KSs  are  capable  of  appreciating  the  extent  to 
which  various  hypotheses  are  partially  supported  by  disparate  but  cooperative  data 
scattered  about  the  blackboard,  a breadth-first  approach  should  exhibit  some 
"intelligence".  Alternatively,  a depth-first  approach  is  desirable  whenever  KSs  make 
few  errors.  For  example,  if  word  recognition  becomes  very  good,  then  It  should  be 
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possible  to  roly  upon  the  words  and  upon  the  inferences  (e.g.,  other  predicted  words) 
which  are  derived  from  them.  This  reduction  in  Ihe  necessary  parallelism  of 
hypothec, ration  makes  depth-first  a reasonable  strategy.  In  the  interim,  however,  It  Is 
apparent  that  Ihere  may  bo  enormous  differences  in  the  overall  syslem  performance 
untie,  these  different  control  policies.  II  is  hoped  that  in  Ihe  near  future  empirical 
dale  on  Ihe  relative  utility  of  these  different  strategies  can  be  obtained.  Moreover,  If 
Ihe  relative  effectiveness  ol  these  different  control  strategies  can  be  associated  with 
formal  properlies  of  a problem's  structure  and  complexity,  it  may  be  reasonable  to 
anticipate  that  such  empirical  observations  will  be  helpful  in  evaluating  the  formal 
complexity  of  tho  spoech  understanding  problem. 


In  summary,  it  is  suggested  that  the  principles  and  mechanisms  described  in  the 
preceding  sections  provide  a parameterized  framework  for  the  elaboration  of 
numerous  alternative  macroscopic"  policies  for  attentional  control  in  the  speech 
understanding  problem.  Each  of  the  typical  sorts  of  heuristic  problem  solving  policies 
can  be  realized  by  simple  policy  modules  which  manipulate  goal  utilities  and  respond  to 
quiescence  in  policy-specific  ways, 


SUMMARY 


By  schematizing  knowledge  sources  as  [precondition  ->  response]  rules,  each 
potential  behavior  of  tho  Hearsay  It  system  is  viewed  as  an  instantiation  of  such  a 
form.  These  KS  instantiations  are  seen  to  be  [stimulus  frame  ->  response  frame] 
arhon  descriptions.  The  desirability  of  an  instantiation  is  then  computable  from 
several  characteristics  ol  the  stimulus  and  response  frames.  By  enumerating  ihe 
fundamental  principles  lor  attentional  control,  a desirability  measure  is  produced  which 
handles  most  ot  the  problems  in  focusing.  Several  additional  objectives  make 
elaboration  of  this  simple  strategy  desirable.  In  order  to  accomplish  more  precise 
overall  control,  computations  are  made  of  the  current  state  of  the  analysis,  the  implicit 
goal  state  ot  the  syslem,  and  the  relative  degree  of  goal  satisfaction  ol  each  KS 
invocation.  Once  the  desirability  of  each  KS  invocation  is  computed,  the  execution  ol 
the  most  desirable  first  serves  to  accomplish  an  apparently  optimal  allocation  of 
computing  resources.  In  addition,  our  framework  provides  an  excellent  environment  in 
which  to  explore  empirically  the  utility  of  many  global  focusing  strategies.  Each  of 
these  can  be  expressed  in  terms  of  particular  weightings  of  the  contributions  ol 
various  terms  to  the  desirability  of  a KS  invocation  or  by  simple  modules  which  create, 


Focus  of  Attention  99 


Hnyes-Roth  & Lesser 


4 


modify,  and  .non, tor  goals  which  control  the  direction  of  analysis.  The  relat.vely  small 
gram  size  of  Knowledge  representation  and  fine  identification  of  the  type  and  location 
of  knowledge  source  contributions  apparently  affords  great  advantages  in  constructing 
mechanisms  to  control  a large,  distributed,  Knowledge-based  understanding  system 


ACKNOWLEDGMENTS 


We  would  like  to  acknowled6e  the  help  of  the  following  people  in  the  d.siSn  end 

Oav  dX  ,'heS8  .'deaS  ,h'  HS"  S/S,em:  °°nald  K°Sy’  Crai*  Everhert,  >"=1 
eown.  n addition,  Jack  Mostow  has  made  numerous  contributions  to  the 

onc.ptu.lii.  , on  end  development  pt  dynamic  thresholds  which  tec specltlc 

focusing  policies. 


REFERENCES 


Erman’dL[  V\L™ner’r'  R‘  A mU,,Hevel  organization  for  problem  solving  using  many, 
d e,  .t  ooperatmg  sources  of  knowledge.  Proc.  of  the  4th  IJCAI,  1975 

FerK  D,  ’ Erman'  L D-  * Reddy>  D-  R-  Organization  of  the  HEARSAY  II 
speech  understanding  system.  IEEE  Irans,  .on  Acoustics.  Speech,  and  Signal 
Processing.  1975,  ASSP-23.  11-23.  1 — 

Paxton  W.  H„  ^ Robinson,  A.  E.  System  integration  and  control  in  a speech 

R rid  "C  ?,rSranC  n°  SySt6m'  A-  1 Center’  TecK  me  1U>  SR>.  Menlo  Park,  Ca.  1975 

Reddy,  U R^  Ernian.  L Neely,  R.  a A model  and  a system  tor  machine  recognition 
peech-  Ira  ns,  Audio  and  Electroacoustics  AU-2 1,3,  1973,  229-239 

°,VerV/ieW  °f  BBN  SPEECHLIS:  a"  experimental  prototype 
for  .peech  understanding  research.  P^c,  of  IEEE  Symposium  on  Speech 
Pepognition,  Carnegie-Mellon  Univ.,  Pittsburgh,  Pa,,  1974,  1-10.  ^ 


Focus  of  Attention  100 


HYPOTHESIS  VALIDITY  RATINGS 
IN  THE  HEARSAY  II  SPEECH  UNDERSTANDING  SYSTEM 

Frederick  Hayes-Roth,  Lee  D.  Erman  and,  Victor  Leaser 

Department  of  Computer  Science 
Carnegie-Mellon  University 
Pittsburgh,  Pennsylvania 
January  26,  1976 


ABSTRACT 


The  HF.ARSAY  II  speech  understanding  system  under  development  at  Carnegie- 
Mellon  University  is  a complex,  distributed  logic  processing  system.  Processing  in  the 
system  is  effected  by  independent  knowledge  sources  --  data-driven  procedures 
which  examine  and  alter  values  in  a global  data  base  representing  hypothesized 
segments  (lowest  level),  phones,  phonemes,  syllables,  words,  and  phrases  (highest 
level)  as  well  as  the  hypothetical  temporal  and  logical  relationships  among  them.  Each 
knowledge  source  can  support  a hypothesis  unit  in  two  ways:  it  can  assert  that  the 
unit  is  recognizable  in  terms  of  lower  level  hypotheses  or  that  it  is  predictable  with 
some  degree  of  uncertainty  on  the  basis  of  previously  hypothesized  units.  These  two 
types  of  support  are  called  upper  and  lower  implication,  respectively.  A method  for 
determining  the  validity  rating  of  a hypothesis  based  on  an  initial  validity  estimate  and 
the  current  validities  and  implications  of  supporting  hypotheses  Is  presented. 

Hypotheses  in  Hearsay  II  represent  a variety  of  types  of  interpretations  of  the 

| 

speech  signal,  including  acoustic  segmentation  hypotheses,  phonetic  hypotheses, 
phoneme  hypotheses,  syllable  hypotheses,  word  hypotheses,  and  phrasal  (partial 
parse)  hypotheses.  Each  hypothesis  is  one  of  two  general  types,  conjunctive  or 
disjunctive.  The  conjunctive  hypotheses  may  represent  either  logical  products  ("and" 
relationships)  or  temporal  sequences  of  lower  level  hypotheses.  The  disjunctive 
hypotheses  represent  logical  summations  ("or"  relationships)  among  lower  level 
hypotheses,  The  degree  to  which  each  lower  level  hypothesis  contributes  to 
(supports)  the  hypothesized  relationship  is  indicated  by  two  numbers,  a weight  and  an 
implication,  which  are  attached  to  a link  from  the  lower  to  the  upper  hypothesis.  The 
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implications  range  from  -100  (maximally  discontinuing)  to  +100  (maximally  confirming). 
The  associated  weights  range  from  0 (least  significant)  to  100  (most  significant). 
Regardless  of  its  type,  each  hypotheses  may  receive  predictive  support  from  other 
hypotheses  via  links  with  predictive  implications  ranging  from  -100  (maximally 
counterindicative)  to  +100  (maximally  indicative).  In  addition,  every  hypothesis  may 
receive  "respelling"  support  from  predicted  hypotheses  which  it  may  partially  or  fully 
realize.  For  example,  a prediction  for  the  class  of  words  referred  to  as  IME 
(containing  "me"  and  "us")  gives  respelling  support  to  any  hypothesis  of  the  words 
"me"  or  "us"  which  might  realize  the  expected  word  class  (i.e.,  which  occurs  in  the 
same  time  area  as  the  8ME  hypothesis).  Respelling  implication  relationships  are 
reflected  by  links  from  the  upper  to  the  lower  hypothesis  and  are  associated  with 
implications  ranging  in  value  from  -100  to  +100. 

A system  rating  policy  module  (RP0L)  is  responsible  for  determining  the  validity 
of  each  hypotheses  as  a function  of  the  validities  and  implications  of  the  hypotheses 
which  support  it.  Two  distinct  validity  figures  are  computed.  The  upper  validity  (UV) 
is  a measure  of  the  extent  to  which  an  hypothesis  is  supported,  ultimately,  from  the 
acoustic  data.  This  figure  is  computed  from  the  hypotheses  which  support  an 
hypothesis  directly  from  below.  The  lower  validity  (LV)  is  a measure  of  the  extent  to 
which  an  hypothesis  is  a plausible  prediction  of  other  hypotheses.  This  figure  Is 
computed  from  the  hypotheses  which  directly  predict  an  hypothesis  or  respell  into  It. 

The  formula  for  the  UV  of  a disjunctive  hypothesis  is  the  maximum  of  the 
product  of  any  UV  of  a supporting  hypothesis  times  the  Implication  from  it  to  the 
upper  hypothesis  divided  by  100.  The  formula  for  the  UV  of  a conjunctive  hypothesis 
is  a weighted  average  of  the  terms  (UV  * IMPLICATION  / 100)  computed  for  each 


f 
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hypothesis  supporting  the  hypothesis  from  below.  The  significance  weights  associatea 
with  each  link  determine  the  relative  contribution  of  each  such  term  in  the  overall 
weighted  average  In  addition,  because  computing  a weighted  average  tends  to  cause 
the  UV  of  hypotheses  with  many  supporting  hypotheses  to  move  toward  0,  the 
weighted  average  of  terms  is  multiplied  by  a normalizing  factor  based  solely  on  the 
number  of  terms.  This  factor  is  1.0,  1.1,  1.2,  1.3,  etc.,  for  hypotheses  with  1,  2,  3,  ... 
conjunctive  supports 

The  LV  of  an  hypotoesis  is  computed  as  the  sum  of  the  terms  (UV  * PREDICTIVE 
IMPLICATION)  for  each  hypothesis  which  predicts  it  pius  the  sum  of  the  terms  (LV  * 
IMPLICATION  / 100)  for  each  hypothesis  which  gives  it  respeliing  implication.  This  sum 
is  increased  or  reduced,  if  necessary,  to  keep  it  within  the  range  -100  to  +100. 

Because  the  UV  is  essentially  a bottom-up  measure  of  validity  and  the  LV  is  a 

I 

top-down  measure  of  validity,  some  means  of  combining  the  two  Is  needed  to. 
determine  an  "overall"  hypothesis  validity  rating.  RP0L  generates  such  an  overall 
validity  rating  by  taking  0.9  * UV  + 0.5  * LV  (restricted  to  the  range  -100  to  +100). 
Each  knowledge  source  in  the  system  (including  the  policy  module(s)  responsible  for 
scheduling  KSs  [Hayes-Roth  & Lesser,  1976])  can  then  utilize  whichever  validity 
measure  is  most  relevant:  UV,  LV,  or  overall  validity. 

Like  other  knowledge  sources  in  Hearsay  II,  RP0L  is  data-driven.  Whenever  the 
validities  of  hypotheses  are  modified,  new  links  are  created,  or  implications  are 
changed,  RPOL  is  invoked  to  recompute  whatever  validities  may  have  changed.  Thus 
validity  changes  propagate  automatically  to  all  hypotheses  immediately  or  Indirectly 
affected.  No  cycies  occur  in  such  propagation  because  the  UV  and  LV  values  are 
separately  computed  from  bottom-up  and  top-down  supports,  respectively.  Finally,  It 
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should  bo  noted  that  hypothesis  validity  rating  is  handled  in  Hearsay  II  as  a system 
policy  function.  Knowledge  sources  contribute  to  the  process  only  by  creating  new 
hypotheses  and  linking  them  to  pre-existing  ones  by  links  whose  implications  and 
significance  weights  completely  specify  the  amount  of  support  which  the  knowledge 
source  can  provide.  This  uniform  method  of  representation  and  evaluation  means  that 
the  effects  of  changes  in  the  plausibility  of  any  hypothesis  are  automatically  computed 
by  the  system,  and  propagated  throughout  the  data  base,  without  necessitating 
knowledge  source  interventions. 
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&BSIMC1 

A new  method  tor  efficient  recognition  of  general 
relational  structures  is  described  and  compared  with  existing 
methods  Patterns  to  be  recognized  are  defined  by  templates 
consisting  of  a set  of  predicate  calculus  relations  Productions 
are  representable  by  associating  actions  with  templates,  A 
network  for  recognizing  occurrences  of  any  of  the  template 
patterns  in  data  may  be  automatically  compiled.  The  compiled 
network  is  economical  in  the  sense  that  conjunctive  products 
(subsets)  of  relations  common  to  several  templates  are 
represented  in  and  computed  by  the  network  only  once.  The 
recognition  network  operates  in  a bottom-up  fashion,  in  which 
all  possibilities  for  pattern  matches  are  evaluated 
simultaneously.  The  distribution  of  the  recognition  process 
throughout  the  network  means  that  it  can  readily  be 
decomposed  into  parallel  processes  for  use  on  a multi- 
processor machine.  The  method  is  expected  to  be  especially 
usetul  in  errorful  domains  (e,p„,  vision,  speech)  where  parallel 
treatment  of  alternative  hypotheses  is  desired.  The  network  is 
illustrated  with  an  example  from  the  current  syntax  and 
semantics  module  in  the  Hearsay  II  speech  understanding 
system. 

INTRODUCTION 

The  work  described  in  this  paper  was  motivated  by 
certain  problems  involved  in  the  task  of  recognizing  general 
structured  patterns  and.  in  particular,  the  problem  of  parsing 
continous  spoken  speech.  From  the  point  ot  view  of  the 
language  parser,  an  essential  qualily  of  speech  is  its  errorful 
nature  Ambiguities  in  acoustic  segmentation,  phonetic 
labelling,  word  hypolhesization,  and  semantic  interpretation 
necessitate  understanding  systems  which  can  deal  etticiently 
with  mult  ple  alternative  hypotheses  about  each  portion  ot  the 
input  [11]  The  usual  methods  of  dealing  with  such  multiple 
hypotheses  typically  entail  an  expensive  search  through  a 
combinatorial  space,  since  they  consider  only  one  hypothesis 
tor  each  portion  ot  input  at  a time,  and  then  exploit  contextual 
relationships  to  eliminate  certain  combinations  ot  adjacent 
hypotheses  as  impossible.  The  data  structure  and  associated 
recognition  procedure  described  in  this  paper  can  be  thought 
ot  as  effectively  reversing  this  process  by  tirst  exploiting 
context  --  thereby  eliminating  all  but  a few  combinations  from 
consideration  and  then  testing  contextually  related 

hypotheses  for  adjacency.  Since  the  contextual  intormation  is 
statically  embedded  in  the  data  structure  itself,  comparatively 
little  work  needs  to  be  done  at  recognition  time.  This  work 
requires  only  the  computation  of  a tew,  simple  operations 
rather  than  a complex  search.  Moreover,  the  method  provides 
an  etticient  way  to  handle  the  spurious  insertions  and 
deletions  characteristic  ot  speech. 

TEMPLATE  GRAMMARS 

In  this  section,^  we  define  template  grammars  for 
recognizing  relational  structures.  A template  normal  form 
(TNF)  for  template  grammars  is  detined.  An  algorithm 
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described  elsewhere  [7]  translates  a given  template  grammar 
into  an  equivalent  TNF  grammar  which  is  economical  in  that  it 
maximally  exploits  repeated  subtemplates  in  the  original 
grammar  The  construction  ot  an  automatically,  compilable 
recognition  network  (ACORN)  trom  a TNF  grammar  is  described 
in  the  next  section.  The  definitions  we  use  are  tailored  to 
natural  language  understanding,  but  are  immediately 

generalizable  to  other  applications  (ej^  vision). 

A relation  r(x,,  .,  x„)  is  an  n-ary  predicate 

corresponding  to  some  element  or  pattern  in  the  language^ 
For  example,  the  relation  t*tt(x | • x holds  if  the  word  tell 
occurs  in  the  input  utterance  beginning  at  time  Xj  and  ending 
at  time  x2,  In  general,  xj  and  X2  are  temporal  arguments 
specifying  the  time  interval  containing  a recognized  occurrence 
of  the  relation,  and  xg,  ....  xn  are  additional  attributes  ot  the 
occurrence  A relation  is  called  primitive  if  it  corresponds  to  a 
primitive  element  (terminal  symbol)  ot  the  language,  norv. 
primitive  it  it  corresponds  to  a pattern  ot  elements  (non- 
terminal symbol),  and  top-level  it  it  corresponds  to  a complete 
pattern  (sentential  form). 

A template  T is  a Boolean  combination  ot  relations  r |t 
i-1,  . |T|,  restricted  as  tollows.  It  must  be  either  a disjunction 

r j(x j xn>  v r2(xj xn)  v ... 

v rd(xj,  ....  xn),  |T|  - d i 1, 
or  a conjunction 


fl^lj xn A A 'p^lp1 

) A ...  A 


) A 


~rp+l*xlp+]'  ' Xnp  + 1' 

-rq(xlq,  ....  xnq),  d-P-1 

In  the  tirst  case  (disjunction),  the  symbolic  arguments 

(xj,  ....  xn)  are  the  same  for  each  r j,  i - 1 d In  the  second 

case,  a weaker  condition  must  be  satisfied:  the  relations  must 
have  enough  symbolic  arguments  in  common  tor  the  template 
to  be  connected,  that  is,  for  any  partition  of  the  p+q  relations 
r.  ....  r into  two  non-empty  sets  A and  B,  there  must  exist 
relations  rg(x j , ....  x„  ) < A,  rb^l^  xnJ  ( 0 wllh 

i*la,  •••>  xnal  n lxl*b’  *nbl  * V- 

A template  grammar  is  a set  of  rules  of  the  form 

[<template>  «>  <relation>;  <action>].  The  action  optionally 
associated  with  each  rule  specifies  what  should  be  done  in  the 
event  that  an  instance  ot  the  template  is  recognized  in  the 
input  and  the  rule  is  invoked.  Thus  a template  grammar  is 
actually  a production  system  ot  the  sort  described  by  Newell 

n 2i. 


Table  1 Sample  Grammar  (Gap) 

1 [ Forddj,  t2)  ->  TOPlCftp  t2,  expr)j  exprr-TORD"  ] 

2.  [ Rockefeller^  j,  t2)  v Rockydj,  t2)  -> 

T0P1CU  j,  t2,  expr);  expr<-"ROCKEFELLER"  ] 

3.  [ Kissinger(t  j,  t2)  -> 

T0P1CU  j,  t2,  expr);  expr«-“KISSlNGER"  ] 
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4 ( or(t | , t2)  a TOPIC(t2,  l3,  expr)  ->  TOPIC*(t,,  tg,  exprV,  ] 

5.  ( TOPIC!!,,  t?,  expr,)  a TOPlC*(t2,  Ig,  expr2)  -> 

TOPIC*!!  | , tg,  expr);  expr  * expr , v expr2  ] 

6 ( UTTERANCES  | , Ig)  a «boutd2,  Ig)  a 
TOPIC! 1 3,  tg,  expr)  ->  TOPKXtg,  Ig,  expr);] 

7 [ UTTERANCE!! ,,  tg)  a t el  It  I ,,  t2)  a ma(t2,  tg)  a 
nothmgSg,  tg)  a aboutdg,  t5)  ATOPtC*(tK,  tg,  expr)  -> 

REJECTS  j,  tg,  expr);  SUPPRESS(expr)  ] 

8 ( UTTERANCES tg)  a tolKtj,  t2)  a mad2,  tg)  a 
-nothingUg,  tg)  a aboutdg,  t5)  aTOPIC*(I5,  tg,  expr)  -> 
REQUESTS,,  tg,  expr)i  RETRILVE(expr)  1 

As  an  example,  consider  the  sample  lemplate  grammar 
O'ap  (Table  1)  which  is  part  of  a much  larger  grammar  for 
analyzing  spoken  queries  to  a wire -service  news  retrieval 
system  [4]  GAP's  top-levet  relations  are  REQUEST  and  REJECT. 
An  instance  of  REQUEST  is  the  utterance  "Tell  me  alt  about 
Rocky  ’ An  instance  of  REJECT  is  the  utterance  "Tell  me 
nothing  about  Eord,  Rockefeller,  or  Kissinger,"  The  primitive 
relation  UTTERANCES  ,,  1 2),  used  in  rules  6,  7,  and  8,  simply 
signifies  that  the  entire  utlerance  spans  the  time  interval 
(tj,t2J;  Ihis  makes  the  beginning  and  ending  times  of  the 
utterance  accessible  as  arguments  to  other  relations,  without 
violating  the  framework  of  the  template  grammar.  Rule  2 
illustrates  the  use  of  features  and  actions.  The  feature,  expr, 
of  TOPIC  is  the  semantic  expression  eventually  passed  to  the 
actual  news  retrieval  routine.  The  action  of  Rule  2 gives  expr 
the  value  "ROCKEFELLER,"  Rule  5 is  an  example  of  recursion,  it 
handles  phrases  of  the  form  "topic,,  topic2,  ....  topicn_,,  or 

topicn"  The  action  of  Rule  5 forms  a compound  semantic 

expression  from  the  expressions  associated  with  its  individual 
constituents  Thus  the  instance  "Ford,  Rockefeller,  or 

Kissinger"  of  the  relation  TOPIC*  has  expr  - {"FORO", 

’ROCKEFELLER",  "KISSINGER")  Rule  6 shows  how  context 
sensitivity  can  be  embedded  in  a template  grammar.  It  states 
that  any  instance  of  TOPIC  which  occurs  at  the  end  of  an 
utterance,  and  whose  left  context  is  ABOUT,  constitutes  an 
instance  of  TOPIC*.  Rule  8 illustrates  the  use  of  negation.  11 
states  that  any  utterance  of  the  form  "Tell  me  ...  about  X"  is  a 
request  for  information  about  X unless  the  gap  contains 
the  word  "nothing”  Thus  "Tell  me  about  Ford,"  "Tell  me  all 
about  Ford,"  and  "Tell  me  everything  you  know  about  Ford," 
are  all  instances  of  REQUEST  This  illustrates  the  capacity  of  a 
template  grammar  to  ignore  redundant  portions  of  the  input, 

A lemplate  gramma-  is  in  template  normal  form  (TNF)  if 
the  following  conditions  are  satisfied: 

(1)  The  template  of  each  rule  has  one  of  the  following 

types: 

<relation,>  v <retation2>  v ...  v srelation^x,  d>l 

(disjunctive  type) 

<relation,>  a <relalion2>  (conjunctive  type) 

delation,  > a ■»  <relation2>  (negative  type) 

The  relations  in  a disjunctive  template  have  the  same 
symbolic  arguments;  the  relations  in  a conjunctive  Or  negative  * 
lemplate  are  connected. 

(2)  Every  non-primitive  relation  appears  on  the  right 
side  of  exactly  one  rule.  Hence  we  can  define  the  type  of  a 
relation  to  be  the  type  of  its  unique  defining  template;  a 
primitive  relation  is  simpty  said  to  be  of  primitive  type. 

It  is  clear  that  any  template  grammar  G can  be 
translated  into  an  equivalent  grammar  G*  in  TNF  by  means  of 
adding  new  relations  and  rules.  The  task  of  the  automatic 
translator  is  to  do  this  in  such  a way  as  to  minimize  the 
n imber  of  new  relations  added.  The  algorithm  we  employ  is 
described  in  [7],  The  result  of  applying  the  algorithm  to  the 


sample  grammar  of  Table  1 is  grammar  GAp*,  shown  in  Table  2 
Mnemonic  conventions  used  in  GAP*  are  Ihese  "+"  indicates 
concatenation;  " “ indicates  temporal  overlap,  parenthetical 
phrases  indicale  temporal  contexts,  and  "/k"  disl  nguishes 
different  TNF  relations  arising  from  occurrences  of  a single 
relation  in  various  different  rules  of  the  original  grammar 


Table  2.  Sample  Grammar  in  TNF  (GAP*) 

1*  [ FordHj,  t2)  -> 

TOPIC/KIj,  l2,  expr);  expr*-"FORD"  ] 

2*  [ Rockefeller^ ,,  t2)  v Rocky(t,,  l2)  »> 

T0PtC/2(t,,  t2,  expr);  expr*- "ROCKEFELLER"  ] 

3*.  [ Kissinger(t,,  t2)  «> 

T0PIC/3(t ,,  t2,  expr);  expr*-"KISSINGER"  ) 

[ °r(t,,  13)  A TOPIC(l2,  tg,  expr)  *■> 

TOP tC*/4( t j , tg,  expr);  ] 

5*  [ TOPICd,,  l2,  expr,)  a TOPIC*(l2,  tg,  expr2) 

">  T0PIC*/5(lj,  tg,  expr);  expr  >-  expr,  u expr2  ] 

6‘*.  [ UTTERANCE!!,,  tg)  a (ABOUT)TOPlC(t?,  Ig,  expr) 
->  TOPIC*/6(t2,  tg,  expr);  ]. 

7****.  [ TELL»ME-UTTERANCE-ABOUT»TOPIC* 

(•2>  *3>  (a-  e*Pr>  A nolhing(t2,  tg)  ■> 

REJECf(t,,  tg);  SUPPRESS(expr)  ] 

8‘***  [ TELL.ME-UTTERANCE-AB0UT.T0P1C* 

(•2>  *3.  1 1 > tq.  expr)  A - nothlng(t2,  tg)  »> 

REQUEST!!,,  tg);  RETRlEVE(expr)  ) 

9.  [ teIKt,,  t2)  a m#(t2,  tg)  ->  TELL«ME(t,,  tg);  ] 

10  [ abouKt,,  t2)  a T0PIC*(t2,  tg)  -> 

AB0UT*T0p’C*(l ,,  Ig);  ] 

11  l TELL*ME(t,,  t2>  a UTTERANCES ,,  tg)  -> 
TELL*ME-UTTERANCE(I2,  Ig,  t,);  ] 

12.  t TELL*ME-UTTERANCE(I2,  tg,  t,) 

A AB0UT»T0PIC*(lg,  tg,  expr)  -> 
TELL*ME-UTTERANCE-AB0UT*T0P1C* 

(,2’  *3-  l|»  *4'  exPr>i  ] 

13.  [ aboutd,,  t2)  a T0PIC(t2,  tg,  expr) 

->  (ABCUT)TOPlC(t2,  tg,  expr);  ] 

14.  [ TOPIC/ldj,  l2,  expr)  v 

T0PIC/2(! ,,  t2,  expr)  v T0PlC/3(t , , t?,  expr)  -> 

TOPICU,,  t2,  expr);] 

15.  ( T0PIC*/4(t,,  l2,  expr)  v 

T0PIC*/5(t , , t2,  expr)  v T0PlC*/6(t,,  to,  expr)  -> 
TOPICU, , t2,  expr);  ] 


THE  RECOGNITION  NETWORK 

Given  a template  grammar  in  TNF,  a corresponding 
recognition  network  (ACOPN),  as  first  described  in  [6],  is 
constructed  as  follows  For  each  relation  r appearing  in  the 
TNF  grammar,  there  is  a unique  node,  node(r),  in  the  network. 
(Hence  minimizing  the  number  of  relations  in  the  TNF  grammar 
is  equivalent  to  minimizing  the  number  of  nodes  in  the 
network.)  For  every  rule  [T  ->  r;  A],  an  arc  is  drawn  from 
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nodefs,)  to  node(-)  lor  each  relation  s,  in  the  template  L Each 
n0de(s  ) is  sa:d  to  be  a coMiUatlA  ot  node(r),  and  node(r) 
der  vaU  of  nodefs,  A node  may  have  zero,  one,  or  more 
derivatives  The  recognition  network  for  the  sJmP  e « 

Gap,  constructed  from  II  <1  grammar  GAP  , is  shown 

8 ’ode(r)  contains  various  information:  its  type  (iJL,  'he 
type  of  relation  r),  the  action  A in  the  rule  [T  ->  rj  ™ys 
and  the  correspondence  between  the  argumen  s ; of  relation 
and  the  arguments  of  its  constituent  relations  s,. 
correspondence  con«ists  of  two  parts,  a set  of  tesU  and  a 
generator.  The  tests  represent  any  requirements 
agreement  between  the  argue, ents  supplied  by  the 
constituents  aodets,).  The  generator  is  a list  . the  argumen 
which  are  to  be  supplied  in  turn  to  the  derivatives  ff  node(r), 

The  arguments  .re  encoded  a-ordmgto  a^canonical  n imping 

“odeaELL.ME)  Its'constduenls'are  node(TELL),  which  supplies 
"gumen.s  I , . . 2.  snd  "Ode(UE>,  wh.ch  supplies  arguments 
♦ Le|  i be  the  concatenated  argument  list  Upi2»‘3»  4 

T3hen  node(TELL.ME)can  specify  its  arguments  by  Iheir  indices 
L Thus  node(TELL*ME)’s  only  test  is  L<2)  - LO),  denoted  by 
"2:3"  helow  node(TEEL.ME)  in  I he  network,  (See  Figure  U 
Similarly,  node(TEL'_‘ME)’s  generator  is  the  list  [ LU '•  ^ /’ 
denoted  by  “(1,4)"  above  nodefTELLME)  in  the  net  work. 
Arguments  which  are  not  supplied  by  a node  5/°,n5’',U^'at|V, 
instead  originate  al  Ihe  node  it  elf  are  specified  by  negative 
indices.  For  example,  node(T0PIC/2)'s  generator  IS > denoted I by 
"(12  I)"  the  1 specifies  (he  argument  expr,  which 

originates’  al  node(T0PlC/2).  The  actionstoredm 
node(T0P!C/2)  assigns  this  argument  Ihe  value  OT  Q 

All  of  Ihe  recognition  network  components  described  so 
far  are  static  There  is  also  associated  with  each  node(r)  a 
dynamic  Instance  li=l  1L.  Each  instance  in  the  'ns,ance ° 
nodetr)  represents  a single  recognized  occurrence 
(instantiation)  of  the  relation  r ,n  the  input  ut  erance  An 
instance  has  several  components:  a unique  identif  ation 

number  I;  Ihe  lime  interval  x,.  x2]  containing  the  occurrence, 

"he  values  x3 xn  of  additional  attributes  of  the  occurrence, 

and  a support  set  SS  containing  one  or  two  instance 
identification  numbers.  An  instance  is  denoted  M*,.  V W 
During  the  recognil.on  process,  instances  are  created  and 

deleted  dynamically.  follows 

The  recognition  process  is  bottom  up,  a ’ 

initially  all  instance  lists  are  empty  A lexical 
invoked  and  begins  to  scan  for  occurrences  of  primitive 
relations  in  the  input  utterance.  Since  the  lexical  analyzer 
receives  imperfect,  incomplete  information  from  he  phonetic 
labelling  routine,  Ihe  best  it  can  do  is  to  identify 
occurrences.  When  it  finds  a possible  occurrence  of  a relation 
r it  adds  a new  element  to  the  instance  list  of  node(r) 
containing  the  appropriate  information  To  understate the 
recognition  process,  imagine  each  node(r)  as  having  a demon 
The  node(r)  demon  continuously  monitors  the  instance  list  of 
lich  constituent  node(c,)  of  ncde(r).  Whenever  a new 
instance  is  added  to  the  instance  list  of  nodefsj),  the  node(r 
demon  adds  a reference  to  this  new  instance  to  its  node(Sj) 
add  set.  Similarly,  whenever  an  existing  instance  ot  s,  is 
deleted,  the  node(r)  demon  saves  a copy  of  it  in  its  nodeteg 
delete  set  Add  sets  and  delete  sets  are  referred  to 
coilekctively  as  change  sets.  [9]  The  demon  then  acbva  es 
(wakes)  node(r)  itself  by  invoking  code  pointed  to  by  node(r  . 

( When  node(r)  is  activated,  it  updates  Its  instance 
according  to  the  information  in  i's  constituents  instance  lists 
and  change  sets  If  node(r)  can  derive  (construct)  any  "ew 
instances  from  instances  of  its  constituents,  it  does  so,  addmg 
the  new  instances  to  its  instance  list.  The  support  se.  o e ch 
instance  contains  Ihe  identification  numbers  of  the  instances 
from  which  it  has  been  derived.  Node(r)  deletes  from  its 


instance  list  any  instances  supported  by  (derived  from)  the 
defunct  instances  listed  in  its  constituents  delete  sets.  The 
. act  way  in  which  all  this  is  done  depends,  of  course,  on  the 

lyPC  If  node(r)  is  disjunctive,  then  it  has  d constituents 
nodefsw,  ,node(sd).  For  each  instance  I:(x,,  , V>  SS>  'na 
node(s,‘  add  set,  node(r)  adds  a new  element 

i , . nn  to  its  own  instance  list,  computing  Zj, zk 

£ZWe  value's  of  x, xn  according  to  the  generator  stored 

in  node(r).  lnew  s support  set  is  !H  because  the  msUnce » Ww 
of  r is  derived  from  (supported  by,  dependent  on)  the  instance 
1 of  r's  constituent  relation  Sj.  For  each  defunct  instance  1 in 
nodetsj)  delete  set,  node(r)  deletes  al'  ir1sances 
1 ,j:(zi,  ....  zk;  SS)  supported  by  1,  i,e.,  suet  that  U SS 
(Adua  ly,  for  disjunctive  r,  all  instances  of  r will  I have  suppor 
sels  of  size  one,  so  1 < SS  iff  SS  - HI.  ^ever,  for 
conjunctive  r,  |SS|  - 2;  hence  the  set  notation). 

If  node(r)  is  conjunctive,  then  it  has  exact  y 
constituents,  node(s,)  and  node(s2),  with  respective  instance 
lists  1L,  and  H2,  add  sets  AS,  and  AS2,  and  delete  »U  DS, 
and  oL  First  node(r)  deletes  any  of  its  instances 
,,.(Zl  zu;  SS)  which  were  derived  from  instances  in  Db, 

Ws1;  ;,!  those  for  which  SS  n (DS,  U DS2)  Jf.  Then 
nodeiryooks  for*w  instep  pairs  •,  V ^ 

(y.  2 y V according  to  the  tests  stored  in  node(r).  For  each 

suih  ' matching  pair,  node(r)  adds  a new  elemen 

1 (z.  z.  II,,  lo})  to  its  instance  list,  using  its  generator 

‘new  zl;  • zk’ t‘1-  2<  11  is  sufficient 

to  check  only  those  pairs  of  instances  1,,  12  of  which  one  o 
Doth  are  new,  or  more  formally,  such  that  either  1,  < AS,  and 
U c 1L,  or  1,  < Hi  and  12  < ASg.  For  example,  suppose  he 
input  utterance  is  "Tell  me  nothing  about  Rockefeller,  and  the 
lexical  analyzer  finds  an  instance  I|K0,  18;  ...»  of  tall  and  a 
instance  12:(IS,  23; .. ) of  ma.  Then  the  test  stored 
node(TELL’ME)  becomes  18-18  which  is  ru  • 
ncde(TELL‘ME)  add.  a new  instance  lnew’.(0,  23>  l 1>  '2>' 
instance  list  to  represent  the  occurrence  o'  tell  me  in  the 
concatenated  time  interval  [0,231  (Time  is  measured I in 
centiseconds  since  the  beginning  of  the  utterance.)  Now 
suppose  the  lexical  analyzer  mistakenly  identifies  the  sy  lable 
"fell"  in  "Rockefeller"  as  the  word  "tell,  and  adds  an  instance 
ln  (257  269;  ...)  to  node(»ell)’s  instance  list  This  may  happen, 
for  example,  if  the  phonetic  labeller  correctly  identifies  the 
"F"  in  "Rockefeller"  as  an  unvoiced  consonant  but  can  t tell 
it’s  an  "F  " a "T,"  or  a "P."  No  harm  is  done,  however,  since 
when  nod'e(TELL»ME)  matches  l3  against  12,  the  test  269  - 18 
fails,  and  no  new  instance  of  TELL*ME  is  derived  from  lg.  This 
example  shows  how  the  ACORN  automatically  weeds  Old 
spurious  instances  hypothesized  by  the  lexical  analyzer  on 
basis  of  incomplete  phonetic  information. 

Finally,  if  node(r)  is  negative,  then  it  has  two 

constituents,  node(s,>  and  node(s2>,  where  r “ <*1  ^2  Le’ 

, 11  „ AS,,  AS„  DS 1,  DS?  be  Ihe  instance  lists,  add  sets,  and 
delete  sets  of  nodefs,)  and  node(s2).  First  node(r)  deletes  «"y 
of  its  instances  lold:(Z , . zk;  SS)  derived  ^"‘1 

instances  in  DS,,  i.e.,  those  tor  which  SS  ft  DS,  «*P.  Than 
node(r)  looks  for  any  instance  pairs  l,:(x,. ....  xn,  bbjf  , 

and  ym;  SS2)  In  AS2  such  that  (x x ) matches 

(y  „ iyJ>  according^  the  tests  stored  in  node(r).  For  each 
suih  pair,  node(r)  deletes  all  of  its  instances  l0|d:<z,,  • , zkibS> 
which  depended  on  1,,  i.e.,  such  that  1,  < SS.  his  is  one 
since  each  such  lotd,  previously  an  instance  of  (s,  a -s2),  is 
now  invalidated  by  a new  instance  of  s2  Adding  instances  o 
node(r)  is  also  a bit  tricky,  and  proceeds  as  follows.  First 
node(r)  constructs  the  set  IS  of  all  instances  1,  < 1L , which 
match  some  12  in  DS2.  Then  node(r)  looks  for  all  instances 
,..(Xl  x i SS  1 ) m AS , U IS  which  match  none  of  the 
instances  H2.  For  e‘ach  such  1„  node(r)  adds  a new 
instance  lnew:<zl-  - zk>  ,0  i,s  ins,ance  llst' 
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Fo  illustrate  this,  let  us  continue  with  our  sample 
utterance,  "Tell  me  nothing  aboul  Rockefeller,"  Suppose  that  at 
some  point  the  lexical  analyser  has  recognized  all  the  words  in 
the  utterance  except  the  word  "nothing,"  and  node(TELL*MF- 
UTTERANCE-ABOUWOPIC*)  has 

I4T  3,41,0,274,  ROCKEFELLER";  on  its  instance  list,  Since 
the  instance  list  of  rode(nofhmg)  is  empty,  node(Rr  UEST)  will 
have  an  instance  15:(0,  274,  "ROCKEFELLER";  {!<,})  on  its 
instance  list.  Now  suppose  that  the  lexical  analyzer  finally 
recognizes  the  word  "nothing,"  and  puts  the  instance 
I^;(23,  41;  on  node(nothing)’s  instance  list.  This  activates 
both  of  nodefnothmgl's  derivatives.  Node(REJECT)  matches  Ifi 
agamsf  I4,  tests  23  « 23  and  41  « 41,  and  accordingly  adds  a 
new  instance  I7,(0,  274,  "ROCKEFELLER";  {lfl,  %})  to  its 
instance  list.  Node(REQUEST)  matches  Ig  against  1^,  tests 
23  * 23  and  41  » 41,  and  accordingly  deletes 

1 ^i  ROCKEFELLER";  {l^j)  from  its  instance  list.  This 
example  shows  how  information  is  accumulated  and  corrected 
dynamically  during  |he  ACORN  recognition  process.  It  also 
illustrates  the  ACORNs  state  saving  nature  and  its  sharing  of 
information  between  top-level  nodes. 

Once  node(r)  has  examined  its  constituents'  change  sets 
and,  if  appropriate,  rev.sed  its  own  instance  list,  it  goes  back 
to  sleep.  Meanwhile,  the  demons  sitting  on  the  derivatives  of 
node(r)  have  been  watching  its  instance  list  and,  when  changes 
occur,  activate  their  nodes.  This  chain  reaction  continue', 
fuelled  by  new  instances  generated  by  the  lexical  analyzer, 
until  the  lexical  analyzer  has  stopped,  all  nodes  are  asleep,  and 
all  change  sets  are  empty 

At  this  point  each  instance  l:(xj,  x„;  SS)  of  a non- 
primitive  node,  node(r),  may  be  interpreted  as  a partial  parse 
ol  the  interval  [xj,  X2],  with  relevant  syntactic  and  semantic 
feature;  given  by  X3,  ...,  xn.  For  example,  when  the 
recognition  of  our  sample  utterance  terminates,  the  instance 
1:(41,  274,  "ROCKEFELLER";  ...)  of  ABOUWOPIC*  .nay  be 
considered  to  be  a partial  parse  of  the  input  interval  [41,  274] 
containing  about  Rockefeller."  Parse  trees  can  easily  be 
reconstructed  from  the  information  contained  in  the  support 
sets  Parses  of  Ihe  entire  utterance  are  given  by  instances  of 
top-level  nodes.  Thus  the  instance 

17  (0,274,  ROCKEFELl  ER“;  {Iq,  Ig})  of  REJECT  constitutes  a 
total  parse  of  the  sample  utterance,  and  supplies  the  semantic 
feature,  expr,  required  by  the  action  SUPPRESS(expr). 

BEIATIQNSH1P  TO  EXISTING  PARSERS 
AND  PATTERN-MATCHERS 

The  original  motivation  which  led  to  the  ACORN  concept 
was  ihe  development  of  a general  automatic  recognition 
system  for  spoken  utterances,  visual  scenes,  and  other 
structured  patterns  in  which  context  is  a fruitful  source  of 
information.  Since  the  speech  understanding  ACORN  treats  an 
utterance  as  a relational  structure,  it  is  related  ooth  to  natural 
language  parsers  and  to  general  pattern-matching  mechanisms. 

The  ACORN’s  closest  relative  among  natural  language 
parsers  is  PARRY  [2],  a program  which  simulates  a paranoid 
individual  being  interviewed  by  a psychiatrist.  PARRY  employs 
a large  library  of  stored  concept  sequence  templates  which 
are  compared  with  segments  of  typewritten  input  sentences. 
Generalization  is  achieved  by  rules  which  rewrite  words  as 
synonymous  concepts,  delete  unrecognized  words  and,  if 
necessary,  delete  one  recognized  word  at  a time  until  a 
template  is  matched.  While  the  approach  underlying  PARRY  is 
very  successful  with  typed  inpuf,  it  appears  to  be  foo  risky 
for  spoken  input  Unlike  the  "perfed"  input  which  PARRY 
receives,  the  input  to  the  syntax  module  of  a speech 
understanding  system  such  as  Hearsay  II  [9]  is  highly 
imperfect  PARRY  can  say,  with  confidence,  "this  portion  of  the 


input  is  such-and-such  (e^  the  word  "oh”),  4o  I’ll  ignore  it;" 
learsay  II  can  only  say  "if  this  portion  of  the  input  is  "oh,"  I 
in  ignore  it;  but  if  it's  really  the  word  "no,"  then  I’ll  need  it." 
An  / 3RN  can  be  thought  of  as  a non  deterministic  version  of 
a PARRY -like  system  in  which  all  possible  parses  a e followed 
simultaneously  in  parallel.  On  the  other  hand,  an  ACORN  15 
capable  of  recognizing  general  graph  structures  and  is  more 
pcwo-fui  than  any  context-sensitive  language  parser  (string 
recognizer). 

Woods’  augmented  transition  network  (ALN)  [1 4]  is  a 
mechanism  for  parsing  natural  language  It  works  top  down, 
uses  backtracking,  and  produces  a formal  parse  of  the  input 
sentence.  In  contrast,  an  ACORN  works  bottom  up,  does  no 
backtracking,  and  extracts  only  those  features  of  the  utlerani.e 
which  are  relevant  to  the  particular  application.  An  ACORN  can 
be  (bought  of  as  a state  saving,  bottom-up  version  of  an  ATN. 

Miller  [10]  has  proposed  a parser  for  spoken  English 
which  builds  multiple  partial  parse  trees  and  employs  a 
complicated  and  heuristic  search  to  combine  them.  An  ACORN 
differs  from  Miller’s  parser  in  handling  all  combinations 
cmultaneously  rafher  than  sequentially,  and  in  the  simplicity  of 
the  matching  operations  it  uses. 

Current  artificial  intelligence  programming  systems  such 
as  PLANNER  [8],  QA4  [13],  and  SAIL  [3]  can  match  a given 
relational  template  against  a data  base.  However,  the  method 
they  use  is  an  exhaustive,  iterative,  and  associative  search.  If 
several  templates  are  fo  be  matched  against  the  data  base, 
they  must  be  matched  one  at  a time.  In  contrast,  the 
< ssociative  matching  operation  performed  ,y  ACORNs 
e 'fee lively  tests  all  the  relations  of  all  the  templates 
• imultaneously. 

The  ACORN’s  nearesl  relative  among  general  pattern- 
matching methods  is  hierarchical  synthesis  [1],  Consider  the 
task  of  matching  a template,  such  as  * schematic 
representation  of  a building,  against  an  input  set  of  line 
segments.  A recognition  algorithm  employing  hierarchical 
synthesis  replaces  the  single,  many-component  template  for 
"building"  with  a hierarchy  of  templates  lor  "doors  " "windows,” 
"stories,"  etc.  A higher-level  template  can  be  matched  only  if 
its  lower-level  constituents  are.  Hierarchical  synthesis 
considerably  reduces  recognition  time  for  two  reasons.  First, 
it  can  exploit  the  repetition  of  subtemplates  by  recognizing  all 
instances  of  a single  subpattern  just  once.  Second,  before 
considering  whether  or  not  the  entire  pattern  specified  by  a 
template  is  present,  it  can  insure  that  all  necessary 
subpatterns  are  present. 

However,  hierarchical  synthesis  as  described  in  [1] 
depends  on  a hierarchy  defined  a priori  by  ttie  user.  This 
limitation  is  transcended  by  Hayes-Roth’s  interference 
matching  method  [5],  which  does  hierarchical  synthesis  in 
parallel  in  all  possii  e directions,  thereby  obviating  the  need 
for  a predefined  hierarchy.  In  interference  matching,  a 
template  is  represented  as  a set  of  relations,  Facb  relation  is 
a predicate  with  one  or  more  symbolic  variables,  The  input  is 
also  a set  of  relations,  whose  arguments  are  constants.  A 
partial  match  consists  of  ?n  assignment  of  input  constants  to 
the  symbolic  variables  of  a subset  of  template  relations  which 
•makes  them  all  true.  Interference  matching  works  by  finding 
partial  matches  and  combining  them  into  complete  matches. 

Like  interference  matching,  the  ACORN  method  is  an 
improved  version  of  hierarchical  synthesis  in  that  it  requires 
no  predefined  hierarchy,  The  ACORN  compiler  itself 
determines  an  economical  hierarchy,  and  embeds  it  in  the  form 
of  a recognition  network.  Hierarchy  selection  can  be  factored 
out  into  a separate  compilation  phase  because  the  choice  of 
hierarchy  depends  only  on  the  templates  and  not  on  the 
individual  input  utterance.  In  interference  matching,  on  the 
other  hand,  hierarchy  selection  depends  on  the  input  pattern, 
and  is  therefore  a part  of  the  recognition  process,  Thus  the 
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ACORN  melhod  combines  I he  convenience  of  automatic 
hierarchy  selection  with  the  etticiency  which  comes  from  using 
a precietined  hierarchy  in  the  recogmhon  process. 

In  real  world  applications,  input  is  matched  against 
several  top  level  templates  Current  methods  of  hierarchical 
synthesis  and  interference  matching  involve  matching  the  input 
against  one  template  at  a time  Such  an  approach  is  clearly 
undesirable  tor  tasks  such  as  speech  recognition,  which  may 
involve  large  numbers  of  templates.  The  ACORN  compiler 
lakes  a whole  set  ot  templates  and  produces  a single,  unified 
recognition  network  tor  it;  common  subtemplates  are  shared 
not  just  within  top-level  templates  but  also  between  them.  An 
instance  of  a subtemplale  is  recognized  just  once  — not 
separately  for  each  top  level  template  in  which  it  occurs, 
Hence  recognition  lime  depends  not  on  the  total  number  of 
templates,  but  just  On  the  number  ot  templates  which  match 
some  portion  of  the  input.  This  property  is  encouraging,  since 
the  number  of  templates  required  to  recognize  a significant 
subset  of  English  would  probably  be  several  thousand, 

tn  sum,  an  ACORN  can  be  looked  at  as  a bottom-up 
version  of  an  ATN;  a parallel  ano  non-delerministic  version  of 
a PAPRY-hke  system;  a general  pallern-matcher;  or  an 
improved  mechanism  for  hierarchical  synthesis,  with  automatic 
hierarchy  selection  and  subtemplale  sharing  between 
templates. 

APPLICATIONS.  IMPLICATIONS.  AND  EXTENSIONS 

tn  order  lor  an  ACORN  to  be  eftlcient,  the  templates  and 
input  data  characteristic  of  the  chosen  problem  domain  should 
tend  to  be  asymmetric,  so  that  a template  will  usually  match  a 
given  portion  of  the  input  in  at  most  one  way.  let  us  i'lustrate 
with  a negative  example  Suppose  the  template  we  wish  to 
match  is  Kg(a,  b,  c,  d,  e),  the  complete  graph  on  tive  vertices, 
represented  by  Ihe  conjunction  of  relations 
tinefa,  b)  a linefa,  c)  a ...  a linefd,  e).  Then  any  occurrence  of 
Kg  (as  a subgraph,  say)  in  the  input  corresponds  to  5!  « 120 
instances  of  T,  since  there  are  5!  dilferent  ways  to  bind  the 
variables  a,  b,  c,  d,  e to  the  five  vertices  of  the  Kg  in  the  input. 
For  symmelries  on  a larger  scale,  the  problem  grows 
combinalorially  worse  Clearly,  an  ACORN  would  be  inetticient 
in  such  a domain,  since  it  would  insist  on  finding  all  instances 
ot  every  template 

Fortunately,  many  problem  domains  do  not  exhibit  this 
bothersome  property.  Speech,  in  particular,  is  highly 
asymmetric,  partly  because  it  is  embedded  in  a one- 
dimensional ordered  temporal  domain.  It  taaIKt t^)  is  true, 
then  tj  < 1 2»  so  tell(t2,  t,)  cannot  be  true,  Symmetries  at  a 
higher  level  can  occur  only  if  there  is  more  than  one 
syntactically  arid  semantically  valid  way  to  group  the  input 
words  into  phrases,  ijr,  it  the  input  is  inherently  ambiguous, 

What  are  the  advantages  ot  ACORNs  tor  speech 
understanding7  The  bottom-up  template-oriented  approach  is 
especially  conducive  to  handling  natural,  idiomatic, 
conversational  natural  language  robustly.  Consider  the 
problem  in  spoken  speech  of  spurious  insertions  such  as  "oh," 
"um,"  "er  " We  wish  to  treat  them  the  same  as  silences,  We  do 
this  by  adding  rules  like  [ohftj,  t2)  ->  SILENCEEt j,  t2);]  to  our 
template  grammar,  and  relaxing  the  test  t2=tg  tor  temporal 
adjacency  between  Iwo  relation  instances,  such  as  telKtj,  tg) 
and  meftg,  tyj),  to  compute  t2«tg  v $ILENCE(t2,  tg). 

This  example  also  illustrates  the  reason  for  non- 
deterministic  application  ot  Colby's  methods  In  a speech 
understanding  system,  Even  it  a spurious  insertion  is 
recognized,  the  corresponding  portion  of  the  input  must  not  be 
discarded,  since  it  may  have  been  recognized  incorrectly,  tf  an 
ACORN  recognizes  an  instance  of  "oh”  in  the  interval  [t,,  t2l  it 
puts  the  instance  <t  | , tg;  ,.,)  on  the  instance  list  of  SILENCE, 
without  discarding  any  information,  That  way,  It  the  intervat 


actually  contains  the  word  "no,"  it  is  still  there  for  the  lexical 
a alyzer  to  find.  !n  conlrast,  when  PARRY  ignores  iulormation, 
it  hrows  it  away  altogether 

Another  phenomenon  common  to  conversational  p;  r\r  i 
is  the  idiomatic  expression,  "How  are  you7"  Usirik  in 
ACORN,  we  can  simply  include  explicit  template  rules  tor  such 
expressions,  e g,, 

(how»are»you(t j,  t2)  •> 

GREETtNGO,,  t2);  REPLY("Fine,  how  are  you7")], 

thereby  short-circuiting  the  detailed  syntactic  parse 
which  would  be  attempted  by  a more  formal  system  such  as 
Woods’, 

The  two  techniques  just  described  can  be  combined. 
Certain  idioms  such  as  "by  the  way"  carry  essentially  no  useful 
intoi  mation  and  can  be  treated  as  spurious  insertions  by  rules 
like 

[byftj,  l2)  a tha(t2,  tg)  a way(tg,  \„)  -> 

SILENCER,,  t zj);]. 

Some  expressions  occur  either  as  meaningless  Idioms  or 
as  meaningful  phrases,  depending  on  context.  Consider,  for 
example,  the  utterance  "I  see,  could  t see  the  midnight  digest?," 
which  occurred  in  an  actual  experimental  protocol.  The  first 
occurrence  ot  "I  see"  is  idiomatic  and  can  be  ignored;  the 
second  is  essential  to  the  meaning  of  the  utterance.  An 
ACORN,  in  processing  this  utterance,  would  recognize  both 
occurrences  as  instances  of  SILENCE,  without  discarding  any 
information  The  first  occurrence  would  be  ignored,  as 
desired,  but  the  second  one  would  still  be  available  to  match 
Other  templates. 

Spurious  deletions  can  also  be  handled  by  ACORNs.  To 
handle  spurious  deletions,  we  want  to  permit  partial  matching 
of  templates.  We  can  do  this  within  the  ACORN  framework 
simply  by  adding  extra  templates  corresponding  to  commonly 
occurring  partial  matches  of  the  original  templates.  The 
obvious  weakness  of  this  method  is  that  it  requires  £ priori 
knowledge  ot  which  deletions  are  likely  to  occur,  The  success 
Ot  the  method  might  require  many  iterations  over  a large 
corpus  of  test  utterances,  with  new  templates  added  as 
needed  Hopefully  this  p-ocess  would  converge,  after  a 
reasonable  number  of  such  iterations,  to  acceptable 
performance  with  respect  to  handling  spurious  deletions.  (This 
method  of  "massive  iteration"  seems  to  have  worked 
successfully  for  PARRY.) 

Partial  templates  can  be  used  tor  another  purpose  as 
wel1.  Although  the  bottom-up  approach  has  several 
advantages,  as  described  above,  it  is  usetul  to  have  certain 
properties  associated  with  top-down  processing.  One  such 
property  is  the  ability  to  focus  the  attention  ot  lower-tevet 
modules  on  critical  portions  of  input.  Another  is  the  ability  to 
hypothesize  words  trom  above,  for  lower-level  modules  to 
confirm  or  reject.  Although  we  earlier  reterred  to  a lexical 
analyzer  which  finds  all  instances  ot  primitive  relations  (words) 
in  the  inpjt  utterance,  this  would  in  practice  be  too  expensive 
The  actual  Hearsay  tl  system  seeks  to  constrain 
hypothesization  as  much  as  possible;  to  do  this  it  applies  high- 
tevel  Information  to  cut  down  the  number  of  plausible  words 
matched  against  each  portion  ot  the  input.  Thus  it  is  desirable 
to  have  a speech  understanding  ACORN  generate  intermediate 
partial  information  telling  the  lower  level  modules  which 
portions  of  the  input  they  should  concentrate  on  processing, 
and  which  words  are  likely  to  occur  at  a given  place  in  the 
input,  on  the  basis  of  the  already  recognized  portions  of  the 
surrounding  context. 

This  top-down  extension  to  the  basic  bottom-up 
mechanism  requires  know, edge  about  the  predictive  value  of 
partial  templates.  For  example,  we  know  that  "What  time" 
otten  occurs  in  the  phrase  "What  time  is  it?"  We  can 
incorporate  this  information  in  an  ACORN  by  including  a rule 
[ what(t,,  t2’  « tim#(t2,  tg) -> 

WHAT.TIMEO,.  tg);  TESKtg,  t,ny;  "is  it")  ], 
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whpre  TEST  is  the  action  invoked  upon  recognition  o( 

Ihn  template  The  effect  of  the  TEST  is  to  look  for  the  missing 
instance  of  “is  it"  starting  at  the  time  t3  m the  input  utterance^ 

If  it  is  found,  it  is  added  to  the  instance  list  for  is*it,  leading  to 
the  desired  completion  of  the  full  template  "What  lime  is  it 

to  the  above  example,  a partial  template  was  used  to 
predict  downwards  in  the  network.  Partial  templates  can  also 
be  good  upward  predictors  For  example,  given  aninstancec 
the  partial  template  T,  - "time  is  it,"  the  probability  P(T?|Tj) 
that  it  occurs  as  pari  of  the  template  T2  - 'What  time  is  it 
may  approach  certainty,  tf  HT2|T,)  is  high  enough,  say  .99. 
we  may  wish  to  save  processing  time  by  simply  predicating 
that  T2  does  m fact  occur.  Such  a scheme  is  currently  being 
implemented 

EVALUATION  Ah 3 CONCLUSIONS. 

A full  evaluation  of  t a ACORN  method  must  of  course 
await  experience  with  ta-ge-scale  implementations  In  the 
meantime,  there  are  several  properties  we  observe  from  the 
current,  partial  implementation. 

(1)  The  recognizer  is  efficient. 

(2)  It  is  extremely  easy  to  modify,  since  changes  are 

restricted  to  the  template  grammar. 

(3)  Using  an  ACORN  makes  it  possible  to  dispense  with  a 

formal  parse.  . . „ 

(4)  Even  when  an  ACORN  cannot  fully  parse  an 

utterance,  it  can  still  provide  a partial  parse. 

(5)  ACORNs  are  organized  so  as  to  factor  recognition 
processing  into  simple,  universal,  and  independent  operations 
performed  at  the  nodes.  This  has  made  them  trivial  to 
implement  and,  in  addition,  makes  them  welt-suited  to  parallel 

execution  on  a multiprocessor 

Finally,  we  expect  ACORNs  to  have  a broad  range  of 
applications,  since  they  seem  well-suited  to  recognizing  any 
sort  of  relational  pattern  which  manifests  few  symmetries. 
Both  spoken  utterances  and  real-world  scenes  appear  to  bem 
this  class.  At  this  point,  we  have  implemented  one  ACORd 
processor  for  the  syntax  and  semantics  in  speech  (SAbb) 
module  of  Hearsay  II.  Another  ACORN  pi  lessor  has  been 
built  for  recognizing  the  occurrence  ot  inferred  patterns 
(abstractions)  in  pattern  learning  training  data.  The 
abstractions  themselves  are  produced  by  a program  catted 
Sprouter  which  grows  a minimal  ACORN  to  recognize  all 
subtemptates  common  to  two  or  more  relational  patterns.  IS J 
From  these  experiences,  it  seems  that  ACORNs  may  provide  an 
effective  mechanism  for  general  recognition. 
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Figure  1 . 

Sample  Recognition  Network  (an  ACORN).  See  text  for  an  explanation  of  the  tests 
i:j  below  the  nodes  and  the  generators  (ij , ....  ik)  above  them. 
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adstract 

Tfie  Nearer.'/  II  speech  undemanding  system  being 
developed  at  Carnegie  Mellon  University  has  an  independent 
Knowledge  source  module  for  each  type  of  speech  Knowledge. 
Modules  corns  uncale  by  reading  writing,  and  modifying 
hypotheses  about  various  constituents  of  the  spoken  utterance  in 
a global  data  structure.  The  syntax  and  semantics  module  uses 
rules  (productions)  of  four  types-  (1)  recognition  rules  for 
general  ng  a phrase  hypothesis  when  us  needed  constiti  ents 
have  already  been  hypothesized,  (2)  prediction  rules  for  intui  ring 
the  likely  presence  of  a word  or  ph,  ase  from  previously 
recognized  portions  of  the  utterance;  (3)  respelling  rules  for 
hypothesizing  the  constituents  of  a predicted  phrase;  and  (4) 
postdiction  rules  for  supporting  an  existing  hypothesis  on  the 
basis  of  additional  confirming  evidence.  The  rules  are 

automatically  generated  from  a declarative  (yz,  non-procedural) 
description  of  the  grammar  and  semantics,  and  are  embedded  in  a 
parallel  recognition  network  for  efficient  retrieval  of  applicable 
rules.  The  current  grammar  uses  a 450-word  vocabulary  and 
accepts  simple  English  queries  for  an  information  retrieval 
system. 

INTRODUCTION:  THE  PROBLEM 

The  fundamental  problem  facing  the  syntax  and  semantics 
component  of  a speech  understanding  system  is  uncertainty.  The 
system  is  uncertain  about  a variety  of  questions,  including; 
whether  a given  word  is  really  uttered  by  the  speaker;  when  a 
recognized  word  begins  and  ends;  whether  a particular  interval 
of  the  utterance  contains  a silence,  a filled  pause  ( er,  urn, 
"uh"),  an  informationless  interjection  ("y’know,"  "I  mean"),  or  an 
information-bearing  word  or  phrase;  whether  a recognized  word 
or  phrase  is  used  in  a particular  sense;  etc.  Any  decisions  made 
on  the  basis  of  such  uncertain  information  are  potentially 
incorrect  and  must  therefore  be  reversible  The  classical  method 
of  reversing  decisions  is  backtracking.  Backtracking  and  best- 
first  evaluation  of  alternative  parses  are  the  primary  strategies 
employed  by  the  Hearsay  I speech  understanding  system  (Reddy, 
ej[  ah,  1973a,  1973b) 

In  Hearsay  tl  (Lesser,  ei  al,  1975)  multiple  alternatives  are 
represented  explicitly  in  a global  data  structure  ( blackboard  ) 
and  considered  in  parallel  rather  than  one  at  a time  as  in  Hearsay 
1.  Processing  is  driven  by  independent  data-directed  knowledge 
source  modutes  (KSs)  which  create,  examine,  and  revise 
hypotheses,  stored  on  the  blackboard,  about  the  utterance.  One 
dimension  of  the  blackboard  is  level  of  representation:  an  interval 
of  speech  may  be  simultaneously  represented  at  the  acoustic, 
phonetic,  phonemic,  syllabic,  word,  phrasal,  and  conceptual  levels. 
The  KSs  translate  from  one  level  to  another  with  the  ultimate 
objective  of  representing  the  utterance  at  the  conceptual  level, 
j,e„  understanding  it.  Hearsay  It  is  a distributed  logic  system  in 
that  control  of  processing  is  distributed  he  ter  ar  chi  cally  among 
the  KSs  rather  than  organized  hierarchically.  Eech  KS  is 
responsible  for  deciding  when  it  has  useful  information  to 
contribute  to  the  analysis  of  the  input. 

The  syntax  and  semantics  KS  in  Hearsay  It  is  called  SASS, 
and  deals  with  hypotheses  representing  words  and  phrases 
perceived  or  expected  in  the  utterance.  From  SASS's  viewpoint, 
the  blackboard  can  be  viewed  as  a chart  of  hypothesized  words 
es  in  Figure  1,  which  represents  the  word  hypotheses  generated 


1 This  research  was  supported  in  part  by  the  Defense  Advanced 
Research  Projects  Agency  under  contract  no.  F44620-73-C- 
0074  end  monitored  by  the  Air  Force  Office  of  Scientific 
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by  lower  level  KSs  in  response  to  the  utterance  "Tell  me  about 
beef."  In  the  figure,  time  goes  from  left  to  r ght  and  the  vertical 
dimension  represents  hypothesis  credibility  on  a scale  from  ■ '00 
to  100,  as  estimated  by  other  KSs,  SASS’s  problem  is  to  find  the 
most  plausible  sequence  of  temporally  adjacent  words 
Plausibility  is  detined  by  the  credib.lity  of  the  individual  word 
hypotheses  and  the  grammatically  and  meaningfulness  of  the 
sequence.  The  concept  of  temporel  adjacency  Is  generalized  to 
tolerale  fuzzy  word  boundaries,  overlap  between  successive 
words,  silences  in  the  middle  of  word  sequences,  and 
unintelligible  intervals.  Since  some  of  the  uttered  words  may  not 
have  been  hypothesized,  SASS  must  be  able  lo  expand  the 
solution  spare  by  interring  the  likely  presence  of  a missing  word 
on  the  basis  of  existing  word  hypotheses.  Such  inferences  are 
relatively  weak  since  several  predictions  may  be  pleusible  in  a 
given  context,  tn  the  example  of  Figure  1,  SASS  hypothesizes 
the  missing  word  "tell"  in  the  interval  preceding  "me  about  beef  " 
Since  SASS  is  uncertain  as  to  which  word  hypotheses  are 
cor  ect,  it  also  makes  several  incorrect  word  predictions  Figure 
2 shows  the  words  predicted  by  SASS  on  the  basis  of  the  words 
shown  in  Figure  1.  The  figures  do  not  retied  the  tact  that  the 
various  hypotheses  are  generated  at  different  times  and  SASS 
starts  generating  predictions  prior  to  completion  of  the  word 
recognition  process. 

In  order  to  control  the  potentially  explosive  search 
through  this  combinatorial  and  expanding  solution  space,  SASS 
must  be  able  to  reflect  the  variable  reliability  of  its  inference 
rules  and  to  relax  its  plausibility  criteria  dynamically  so  as  to 
stimulate  processing  on  unrecognized  porlions  of  the  utterance. 
SASS  must  be  able  to  use  partial  information  tc  guide  further 
processing  in  useful  directions,  To  avoid  duplicated  computation, 
SASS  must  store  and  use  partial  parses,  which  are  intermediate 
computations  (plausible  subsequences)  common  to  many  potential 
parsec  SASS  must  combine  these  partial  parses  into  plausible 
complete  parses,  select  the  best  complete  parse,  interpret  the 
meaning  of  the  recognized  utterance,  and  respond  appropriately. 

The  problems  faced  by  SASS  --  uncertainty,  combinatorial 
search,  fuzzy  pattern-matching,  strong  and  weak  inferences,  and 
the  need  to  exploit  partial  information  --  are  common  to  many 
large  knowledge-based  systems  Efficient  solution  of  these 
problems  appears  to  require  a system  organization  in  which  the 
scheduling  of  inferential  processes  is  sensitive  to  various 

cooperative  and  competitive  relationships  among  the  inferred 
hypotheses.  For  example,  processing  should  be  facilitated  on  en 
hypothesis  supported  cooperatively  by  multiple  sources  of 

information  Conversely,  processing  should  be  inhibited  on  an 
hypothesis  which  competes  --  is  inconsistent  with  a 

strongly  credib'e  hypothesis  Inhibition  in  an  environment  of 
uncertainty  must  be  implemented  non-deterministically,  since  the 
weaker  hypothesis  may  in  fact  be  correct.  Non-deterministic 
inhibilion  is  effected  in  Hearsay  It  by  a focus  of  attention 
mechanism  which  allocates  computational  resources  so  as  to 

consider  the  most  promising  hypotheses  before  others  (Hayes- 
Roth  ft  Lesser,  1 976). 

The  approach  used  in  SASS  is  relevant  to  pattern 

recognition  for  its  fuzzy  pattern  matching;  to  problem  solving  for 
its  flexible  combination  of  bottom-up,  top-down,  forward 
inferencing,  end  problem  reduction  merhanismsi  and  to 
information  retrieval  and  the  problem  of  pattern-directed 
function  invocation  for  its  efficient  tnec  ism  for  continuously 
monitoring  a data  base  for  occurrences  o,  any  of  a large  number 
of  relational  petterns  or  templates. 
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OVERVIEW  nr  METHOD 

Given  a declarative  (Ee,  non-procedural)  description  ot  the 
target  language  which  our  system  is  to  understand,  we  need  to 
convert  it  into  behavior  which  is  adequate  to  understand 
utterances  in  the  language  etticiently  and  robustly  Our  approach 
has  been  to  automate  this  conversion  as  much  as  possible. 
Syntactic  and  semantic  knowledge  about  the  target  language  is 
expressed  in  a compact,  readable  grammar.  A compiler  converts 
the  grammar  into  precondition -response  productions.  The 
productions  are  embedded  in  a recognition  network  to  enable 
etficient  continuous  monitoring  oi  the  blackboard  tor  stimuli 
matching  production  preconditions,  tn  general,  many  productions 
will  be  invocable  at  any  given  lime  Various  scheduling  policies 
servo  to  hasten  the  invocation  ot  productions  which  are 
considered  likely  to  generate  usetul  (correct,  relevant,  and 
necessary)  results  and  to  inhibit  or  defer  less  promising 
invocations 

UNCUISTIC  KNOWLEDGE 

The  grammar  describing  the  target  language  is  expressed 
using  parameterized  structural  representations  (PSRs),  which  are 
sets  ot  attribute-object  pairs.  We  use  a PSR  to  detine  a class  of 
words  and  phrases  which  can  fulfill  the  same  syntactic  or 
semantic  function  in  the  target  language,  The  current  target 
language  consists  of  simple  English  queries  tor  a news  retrieval 
program.  For  example,  the  PSR 

(8CLASS:  SQULRY,  8PNAME:  "PARSED  QUERY", 

< ; 8GIMME+SWHAT, 

<:  TELL+8ME+8RE+  / TOPICS, 

C.  WIIAT ♦HAPPENFO+8 ANYWAY, 

<:  WIIAT + SBE *THE ♦ INt:  WS*SRE +8T0PICS, 

(:  SBE+THt'RE+8ANY+8PIECES+SRE*8TOPICS, 

8ACT10N:  PASS, 

8LEVEL:  300) 

detines  the  class  "8QUERY"  of  possible  queries  in  terms  of  its 
alternative  syntactic  realizations.  The  attribute  "<"  denotes 
membership  in  the  class.  Each  member  of  the  class  is  a sequence 
template  whose  constituents,  separated  by  V,  are  words  or 
phrases.  Phrasal  constituents  are  prefixed  by  "8"  and  defined  in 
turn  by  other  PSRs  Additional  attributes  of  the  class  are  defined 
by  other  components  of  the  PSR.  FACTION:  PASS"  means  that 
SASS’s  response  upon  recognizing  an  instance  of  any  of  the  five 
templates  in  the  class  should  be  to  treat  it  as  an  instance  of 
8QULRY.  The  8LEVEL  attribute  estimates  the  relative 
completeness  of  the  partial  parse  underlying  the  hypothesized 
phrase.  The  PSR 

(ICLASS:  8T0PICS, 

(:  8PLACE, 

<:  8F00D, 

<:  8TECHI'IOLOGY, 

<:  8SCIENCE, 

8G0VERNMENT, 

<:  8 POLITICS, 

<:  8PE0PLE, 

<:  8TOPICS+SCONJUNCTION+8TOPICS, 

8ACTI0N:  PASS,  8LEVEL:  AO) 

defines  the  class  of  possible  topics  in  the  news  in  terms  of  its 
semantic  subclasses.  The  grammar  for  the  current  450-word 
target  language  consists  ot  113  PSRs. 

TYPES  OF  BEHAVIOR  RULES 

SASS  has  a repertoire  of  strong  and  weak  methods, 
represented  by  different  types  of  behavior  rules  used  in 
understanding. 

A recognition  rule  generates  a phrase  hypothesis  in 
response  to  sufficiently  credible  hypotheses  for  the  phrase’s 
constituents  SASS  considers  an  hypothesized  constituent  to  be 
recognizable  if  its  credibility  rating,  determined  by  other  KSs, 
exceeds  a minimum  threshold  for  plausibility.  The  hypothesized 
constituents  may  also  have  to  satisfy  some  structural  condition 
such  as  temporal  adjacency  between  sequential  constituents  of  a 
phrase  A recognition  rule  represents  a strong  inference;  its 


strength  is  the  probability  that  the  recognized  constituents  can 
be  interpreted  as  an  instance  of  the  phrase  For  example,  "beet" 
can  be  interpreted  as  a tood  or  as  a complaint,  depending  on 
context  Recognition  rules  drive  processing  upward  toward  a 
c nplctc  parse  of  the  utterance  from  plausible  partial  parses 
Recognition  behavior  can  be  thought  of  as  bottom  up  parsing. 

A prediction  rule  hypothesizes  a word  or  phrase  which  is 
likely  to  occur  in  the  context  ot  a previously  recognized  portion 
of  the  utterance  Prediction  rules  drive  processing  outward  in 
time  trom  "islands  ot  plausibility,"  and  are  necessary  since  not  all 
words  in  a spoken  ullcrance  may  be  recognized  bottom-up  by 
lower  level  KSs.  Predictive  behavior  can  bo  thought  of  as 
forward  inferoncing.  The  strength  of  a predictive  inference  is 
the  conditional  probability  lhal  the  pred  Jed  constituent  occurs, 
given  that  its  predictive  context  has  been  recognized  This 
strength  is  inv  rsely  related  to  the  number  of  constituents  which 
can  plausibly  occur  in  the  given  context. 

A respollinr,  rule  enumeratively  hypothesizes  the 
constituent'  ot  a predicted  phrase,  by  subdividing  an 

hypothesized  sequence  Eito  hypotheses  for  its  sequential 
constituent1  or  by  split  ling  an  hypothesized  class  into  alternate 
hypotheses  'or  its  various  members.  Respelling  rules  drive 
processing  downward  toward  the  word  level,  so  that  high-level 
phrasal  predictions  i an  ultimately  be  tested  word-by-word  by 
lower  level  KSs,  Respelling  can  be  thought  ot  as  lop-down 
behavior  or  generation  ot  subgoals  trom  goals. 

Finally,  a postdiction  rule  solicits  post  hoc  support  for  (i.e., 
serves  to  increase  the  credibility  ratings  of)  existing  hypotheses 
from  other  hypotheses  in  whose  context  they  are  plausible. 
Postdiction  rules  include  prediction  and  respelling  rules  which  are 
too  weak  to  justify  creation  of  hypotheses,  but  can  contribute 
useful  information  when  the  hypotheses  already  exist.  For 
example,  an  expectation  tor  an  instance  ot  STOPICS  following  the 
word  "about"  should  not  be  respelled  into  hypotheses  for  all  the 
nouns  in  the  vocabulary,  since  to  do  so  would  explode  the  search 
space.  However,  once  the  word  "beet"  is  hypothesized  in  the 
correct  time  interval  on  the  basis  ot  other  knowledge,  the 
hypothesis  should  receive  support  from  the  expectation  for  a 
topic  word 

Postdiction  rules  serve  three  functions:  they  allow 
cooperation  between  inferences  which  support  the  same 
hypothesis  on  the  basis  ot  different  evidence;  they  allow  words 
and  ph  aces  hypothesized  xith  initial  low  credibility  ratings  to  be 
recognized  on  the  basis  of  Iheir  contextual  plausibility;  and  they 
help  iocus  attention  in  productive  directions  by  increasing  the 
ratings  of  hypotheses  which  are  contextually  plausible  (and  thus 
relatively  likely  to  be  correct)  so  that  processing  on  them  is 
scheduled  sooner.  In  the  sense  that  postdiction  responds  to 
weakly-rated  hypotheses  by  seeking  causal  antecedents 
(predictors)  tor  them,  postdiction  can  be  thought  of  as  post  hoc 
.nferencing  or  "twenty-twenty  hindsight." 

CONVERSION  OF  STATIC  KNOWLEDGE  TO  BEHAVIOR  RIIIFS 

Most  of  the  information  necessary  for  understanding  the 
target  language  is  implicit  in  the  grammar  which  describes  it.  The 
automatic  convex  f his  static  information  into  a usable 

procedural  form  , ,y  « simple  compiler  called  CVSNET, 

which  translates  5Rs  o recognition,  prediction,  respelling, 

and  postdiction  lies  ew  rules  hand-coded  in  explicitly 
procedural  form  a . dded,  for  example  a rule  that  prints  a 

message  when  a sentence  is  recognized.  The  only  linguistic 
knowledge  in  CVSNET  itselt  is  an  elementary  understanding  of 
• sequences  and  classes.  CVSNET  decomposes  the  sequence 
templates  cj+Cj+.m+c,,  into  pairs  of  subsequence  templates.  For 
example,  from  the  sequence  template  TELL+8ME+8RE+8TOPICS, 
CVSNET  generates  the  new  templates  8ME+8RE+8TOPICS  and 
8RE+8TOPICS. 

CVSNET  then  generates  the  appropriate  rules  for  each 
template.  The  recognition  rule  for  a sequence  is  to  concatenate 
its  hypothesized  subsequences  provided  they  are  temporally 
adjacent  and  sufficiently  credible.  The  respotling  rule  respells  a 
predicted  sequence  into  its  two  subsequences.  Prediction  rules 
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arp  generated  to  predict  the  remaining  constituents  Of  the 
sequence  when  a subsequence  of  it  has  been  recognized^ 
Similarly,  CVSNI'T  generates  rules  for  recognizing  an  instance  < t 
a class  trom  an  hypothesized  constituent  of  the  class  and  for 
respelling  a predicted  class  into  its  constituents.  CVSNC i 
estimate  I1  - strength  ot  each  such  rule  as  an  inverse  function  of 
class  size  CVSNLT  also  generates  the  relevant  posldiction  rules, 
Some  of  the  rules  generated  trom  Ihe  PSRs  are  shown  belowj 
rule  type  is  indicated  by  the  type  ot  arrow  separating  stimulus 
and  response  ("■-*"  tor  recognition,  for  prediction,  +>  for 
respelling,  and  lor  postdiction)  and  rule  strength  is  shown  in 
parer  theses 

TCLL  ft  8M!  - TELI+8MC  < CONCATENATE  (100)  ,100)  > 

TELL  ft  It  ME  <-  TELLsftME  < POSTDICTISEQ  UOOXIOO)  > 
TELL*8ML  +>  TELL  ft  8ME  < RESPELLfSEO  ( 100)  \)00)  > 


SME  ->  TELL  < PREDICT’LEFT  (50)  > 


TELL  <»  SME  < POSTDICTfLEFT  (50)  > 

TCLL  ->  SME+8RL+STOP1CS  < PREDICT IRIGHT  (100)  > 
SML+8RE+8TOPICS  <-  TELL  < POSTDICTIRIGHT  (100)  > 


I FOOD  -»  ITOP1CS  < PASS  (100)  > 

ST0P1CS  ♦>  8F00D  < RESPELLtCLASS  (70)  > 

8FOOD  <-  ITOPICS  < POSTDICTIELEMENT  (88)  > 

The  linguistic  Knowledge  expressed  compactly  in  the 
grammar  is  represented  highly  redundantly  in  the  generated 
rules.  This  redundancy  provides  the  basis  for  robust 
performance  in  the  errortul  domain  of  speech:  in  regions  ot  the 
utterance  where  strong  inferences  (recognition  rules)  are 
inadequate  (for  example,  because  lower-level  KSs  have  failed  to 
hypothesize  some  of  the  uttered  words),  weaker  inferences  must 
be  applied  in  order  for  the  utterance  to  be  understood, 

IDENTIFICATION  OF  IN  VOCABLE  RULES 

All  of  the  rules  described  have  the  form 

[precondition^, x2,...,xn)  ->f  response^, x2,.  ,xn)J,  signifying  that 
a specified  response  can  be  inferred  with  strength  f from  the 
objects  x j , x2,  ....  xn  whenever  these  objects  are  in  the 
relationships  described  by  the  associated  precondition  The  large 
number  of  rules  required  even  in  a relatively  simple  system  (over 
3000  rules  for  a 450-word  vocabulary)  necessitates  an  efficient 
means  of  continuously  monitoring  the  blackboard  to  determine 
which  rules  are  currently  invocable  because  of  data  satisfying 
their  preconditions. 

This  problem  is  solved  by  embedding  the  rules  in  an 
automatically  compilable  recognition  network  (ACORN),  as 
discussed  elsewhere  (Hayes-Rolh  ft  Mostow,  1975).  In  brief, 
each  grammatical  constiluent  (word  or  phrase)  is  assigned  a 
unique  node  in  the  network.  Rules  whose  preconditions  refer  to 
the  constituent  are  stored  at  the  node.  Whenever  an  hypothesis 
tor  the  constituent  is  created  or  revised,  its  node  is  activated  and 
the  relevant  rules  become  invocable. 

PRINCIPLES  OF  CONTROL 

The  rule  preconditions  are  defined  in  terms  of  various 
thresholds  for  plausibility,  temporal  adjacency,  etc.  These 
thresholds  can  be  given  values  specific  to  a particular  region  of 
the  utterance  and  are  dynamically  modifiable.  Thus  rules  are 
invoked  not  only  in  response  to  new  hypotheses  but  also  in 
response  to  local  threshold  changes  This  mechanism  allows 
flexible  matching  of  rule  preconditions,  Thresholds  can  be 
relaxed  in  unrecognized  regions  of  the  utterance  to  permit 
focalized  application  of  methods  whose  weakness  would  cause 


combinatorial  explosion  il  they  were  applied  uniformly  throughout 
the  utterance, 

Hypotheses  are  explicitly  linked  in  the  data  base  to 
hypotheses  which  support  them  inlerentially  and  the  links  are 
•narked  with  the  strengths  of  the  interences.  A rating  pol  cy 
odule  (RPOL)  rates  the  plausibility  ot  new  hypotheses  on  the 
bums  of  the  ratings  of  Ihe  hypotheses  which  support  them  and 
tlu  strengths  with  which  they  do  so,  RPOL  updates  these  ra*  ngs 
when  an  hypothesis  receives  new  support  or  when  the  rating  of 
one  of  its  supposing  hypotheses  is  changed.  Hypotheses  are 
rated  separately  on  their  contextual  plausibility  and  on  the 
extent  to  which  they  are  supporled  by  lower  level  hypotheses 

The  combinatorial  search  can  be  controlled  by  modifying 
the  appropriate  threshold  values.  For  example,  Ihe  search  tan 
be  broadened  or  narrowed  by  relaxing  or  tightening  criteria  for 
recognizability,  since  the  solution  space  consists  only  of 
sequences  of  recognizable  words.  A best-first  search  policy  can 
be  implemented  simply  by  ordering  rule  invocations  according  to 
the  strengths  of  the  rules  and  the  plausibility  ratings  of  the 
hypotheses  matching  Ihe  rules'  preconditions.  The  search  can  be 
further  focussed  by  inhibiting  low  level  processing  within  a 
region  already  accounted  for  by  a credible  high-level  hypothesis 
Of  course  this  policy  must  be  pursued  with  caution  since  the 
high-level  hypothesis  may  be  incorrect.  Cautious  inhibition  is 
implemented  as  deferred  processing.  A similar  policy  of 
procrastination  can  be  used  to  deter  application  of  weak 
inferences  in  a region  unlit  strong  methods  fail,  An  inferential 
process  can  be  deterred  by  scheduling  it  wilh  low  priority  (so 
that  it  may  never  in  tact  be  executed),  or  by  scheduling  it  only 
when  the  relevant  thresholds  are  relaxed  The  latter  mechanism 
permits  reconsideration  of  previously  rejected  alternatives. 

Discourse  rules  can  also  help  to  focus  the  search.  For 
example,  an  hypothesis  that  the  current  topic  ot  conversation  is 
food  increases  the  a priori  probability  that  the  word  beef  will 
be  uttered  It  we  car.  predict  subject  matter  or  syntax  from  any 
one  of  many  knowledge  elements  (mg^  a recognized  cue  word  in 
the  same  tillerance,  semantic  analysis  ol  previous  utterances, 
knowledge  of  the  particular  speaker’s  interests),  we  can  create 
such  an  hypothesis  This  lorm  ot  semantic  and  syntactic  priming 
is  non-restrictive  in  that  it  does  not  preclude  recognizing  an 
utterance  which  is  inconsistent  with  an  hypothesized  topic  of 
conversation  or  an  expectation  tor  a particular  grammatical 
construction.  The  mechanism  is  also  graceful  in  that  it  does  not 
impose  a strict  hierarchy  ot  topical  domains,  and  in  fact  tolerates 
ambiguity  and  uncertainly  in  the  expectations  generated  by 
previous  discourse. 

Inexact  matching  can  also  be  carefully  controlled  with 
thresholds.  An  interval  of  silence  in  the  middle  of  an  utterance 
can  he  accepted  by  relaxing  temporal  adjacency  thresholds  in  the 
region  of  the  silence  so  that  hypothesized  sequence  constituents 
temporally  separated  by  the  silence  will  be  considered 
temporally  adjacent.  For  example,  if  the  speaker  says  Tell  me 
about  . . beef,"  this  mechanism  allows  the  words  about  and 
"beef"  lo  be  considered  temporally  adjacent.  Interjections  and 
unclear  intervals  of  speech  can  be  nondeterministically  ignored 
by  treating  them  as  silences.  Sometimes  the  uttered  words 
cannot  he  recognized  by  lower-level  KSs  even  after  SASS 
hypothesizes  them  on  the  basis  of  surrounding  context.  In  such 
cases,  partially-matched  phrases  can  be  recognized  by  lowering 
credibility  thresholds  in  unintelligible  intervals  so  that  unfulfilled 
expectations  for  missing  constituents  are  treated  as  if  they  had 
been  fulfilled  These  mechanisms  can  even  be  used  to  tolerate 
some  variation  from  the  target  language  by  ignoring  extra 
verbiage  not  accounted  for  in  the  grammai  and  by  filling  in 
omitted  constituents  required  by  the  grammar. 

PFRFOP.MANCE  EVALUATION 

The  contribution  of  each  KS  in  Hearsay  It  is  highly 
dependent  upon  the  behavior  of  the  others  Consequently, 
SASS’s  performance  is  difficult  to  evaluate  For  instance,  SASS  s 
prediction  of  the  missing  word  "tell"  in  the  previous  example  may 
have  been  critical  to  recognition  of  the  utterance.  On  the  other 
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hand,  the  word  hypothesize  KS  might  eventually  have  lowered 
its  own  thresholds  enough  to  have  weakly  hypothesized  the 
missing  "telt."  In  this  case,  SASS's  posldiclion  of  (he  hypothesized 
"tell"  from  its  surrounding  context  might  have  been  critical  in 
increasing  its  credibility  rating  sufficiently  -to  permit  it  to  be 
recognized 

Despite  the  complex  dynamics  ol  the  integrated  system,  we 
do  have  an  evaluation  methodology  for  SASS  which  witl  be 
pursued  in  the  next  year.  Basically,  our  strategy  is  to  generate  a 
variety  of  artificial  problems,  each  defined  by  a set  of 
hypothesized  words,  and  measure  the  elapsed  time  until  SASS 
parses  the  utterance.  In  particular,  we  should  be  able  to 
evaluate  the  relative  etficacy  of  (he  four  types  ot  behavior  rules 
in  overcoming  various  kinds  of  error  in  the  artificial  input.  If  we 
can  then  estimate  the  rotative  frequencies  of  different  kinds  of 
errors  generated  try  lower  level  KSs,  we  can  attempt  to  optimize 
SASS’s  behavioral  profile 

CONCLUSION 

There  are  many  functions  to  be  portormed  by  a syntax  and 
semantics  knowledge  source  within  a speech  understanding 
system.  In  addition  to  simply  parsing  a sentence,  the  knowtedge 
source  must  use  a variety  of  strong  and  weak  interoncing 
methods  to  hypothesize  missing  constituents  and  adduce  suppor1 
for  existing  hypotheses  found  in  appropriate  contexts.  A 
production  system  using  four  types  ol  rules  has  been  developed 
to  implement  such  desirable  "knowledgeable"  behaviors,  which 
are  automatically  interred  trom  a simple  declarative 
representation  of  the  language  to  be  understood.  By  making  the 


invocation  of  a rule  be  dependent  upon  both  the  credibility  of 
the  data  matching  the  rule’s  preconditions  and  the  estimated 
strength  of  the  rule  as  a useful  inference,  the  entire  search 
process  may  be  controlled  so  as  to  pursue  dynamically  modifiable 
global  and  tocal  processing  objectives  In  sum,  such  a production 
tern  provides  a general  framework  for  representing 
"knowledgeable"  syntactic  and  semantic  behaviors.  Moreover,  the 
fine  computational  gram  of  the  behavior  rules  makes  possible  the 
flexible  and  precise  control  needed  to  avoid  a combinatorial 
explosion  in  the  search  lor  a plausible  interpretation  of 
continuous  speech. 
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Figuro  1.  Words  hypothesized  bottom-up  in  response  to  utterance  "Tell  me  sbout  beet" 
marks  correct  hypothesis;  "["  and  "]"  denote  hypothesized  beginning  and  end  ot  utterance 
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Figuro  2.  Words  prodicted  by  SASS  on  the  basis  ot  the  hypotheses  shown  in  Figure  l 
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IN  THE  HEARSAY  II  SPEECH  UNDERSTANDING  SYSTEM 
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Department  of  Computer  Science 
Carnegie-Mellon  University 
Pittsburgh,  Pennsylvania 
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ABSTRACT 

While  much  of  Ihe  syntactic  and  semantic  analysis  of  a spoken  utterance  in  HEARSAY  II  is 
performed  by  the  syntax  and  semantics  knowledge  source  module  (SASS),  the  final  semantic 
interpretation  is  produced  by  a discourse  analysis  module  (DISCO).  Using  knowledge  about  the 
conversation,  DISCO  interprets  the  speaker’s  intention  and  directs  the  appropriate 
activities  within  the  task  program.  In  most  cases,  the  intention  of  the  speaker  is  to  establish  a 
eneral  topic  of  interest,  to  retrieve  articles  by  keyword  expressions,  or  to  have  retrieved 
arlicles  wntlen  on  an  output  device.  In  some  circumstances,  DISCO  can  predict  the  likely 
subsequent  occurrence  of  a parlicular  type  of  phrase.  Such  predictions  cause  SASS,  at  the 
oulset  of  analyzing  a new  ulterance,  to  create  corresponding  phrasal  goals  which  effect  a 
semantic  bias  (subselection,  perceptual  set)  that  influences  which  words  are  hypothesized  and 
how  plausible  they  appear. 

As  described  elsewhere  [Hayes-Roth  and  Mostow,  1976],  a spoken  utterance  Is 
recognized  if  the  syntax  and  semantics  knowledge  source  module  (SASS)  can  generate  a 
grammatical  parse.  In  the  current  task  (document  retrieval  by  keyword  expression),  there  are 
seven  distinct  types  of  intentions  reflected  by  SASS’s  grammar  and  distinguished  by  the 
discourse  analysis  module  (DISCO).  The  most  frequent  is  the  request,  a command  to  retrieve 
documents  by  keyword  expression  (e.g.,  "Give  me  news  about  Ford").  The  next  most  frequent  is 
a selection.,  a statement  which  identifies  general  semantic  areas  or  attributes  of  documents  cv 
intei  est  (e.g.,  Select  from  stories  about  politicians"),  These  two  functions  — formulating  a 
keyword  retrieval  expression  and  selecting  a semantic  area  — can  be  performed  simultaneously 
by  sentences  of  type  request-and-select  (e.g.,  "Give  me  news  on  Ford,  the  politician").  The 
grammar  c ontains  the  information  that  some  words  (e.g.,  "politicians”  or  "politician")  are  cues  for 
memus  of  keyword  expressions  (e.g.,  the  menu  SPOLITICIAN  includes  "Ford",  "Rockefeller",  etc.). 
Recognition  of  such  a cue  causes  generation  of  goals  for  finding  words  or  phrases  from  the 
associated  menu.  If  strong  enough,  these  goals  are  respelled  (enumerated)  into  their 
constituent  keyword  expressions  by  SASS.  The  word  hypothesizer  (POMOW)  is  sensitive  to 
such  goals  and  will  atlempt  to  generate  supporting  word  candidates  in  the  appropriate  time 
region. 

Once  a menu  is  selected,  the  speaker  may  ask  for  the  contents  of  [h§>  current  menu  (e.g., 
What  are  the  keywords  ) and  DISCO  will  enumerate  them.  At  any  time,  the  user  may  ask  for 
blip.  (e.g. i Help  or  What  are  the  topics?")  and  DISCO  will  describe  the  various  content  areas. 
Finally,  DISCO  will  sometimes  ask  questions  (e.g.,  "Do  you  want  to  see  other  stories  on  Ford?") 
with  an  expectation  for  yes  or  no  type  responses. 

The  typical  conversational  flow  is  repesented  by  a probabilistic  finite  state  automaton 
which,  along  with  a list  of  menus,  cues,  and  contents,  is  input  to  DISCO  at  initialization.  Each 
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time  SASS  parr.es  an  utterance,  its  type  and  semantic  content  are  passed  to  DISCO.  The 
appropriate  response  is  determined  and  the  state  of  the  conversational  model  Is  updated. 
DISCO  performs  whatever  action  is  indicated,  such  as  reporting  the  number  of  articles  matching 
the  keyword  expression,  printing  an  arlicle,  supplying  helpful  information,  or  asking  a clarifying 
question.  DISCO  passes  to  SASS  its  predictions  about  which  type  of  utterance  will  follow  and 
which  semantic  classes  are  likely  to  he  mentioned.  SASS  establishes  these  predictions  as 
phrasal  goals  on  I he  blackboard,  During  processing  of  the  subsequent  utterance,  SASS  and 
POMOW  will  exhibit  a bias  toward  hypothesizing  words  and  phrases  which  are  consistent  with 
the  phrasal  goals.  Such  semantic  subselection  of  words  predicted  top-down  can  facilitate 
recognition  of  content  expressions  in  contexts  where  many  different  words  are  syntactically 
permissible.  If  the  subsequent  utterance  violates  the  expectations  of  the  conversational  model, 
successful  recognition  will  not  be  precluded  but  will  probably  be  somewhat  slowed  because  of 
unfavorable  scheduling  competition  from  knowledge  source  activities  attempting  to  realize  those 
expectations. 
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Introduction 

The  development  of  a parametric  level  knowledge  source  for  Hearsay  II  has  been 
closely  tied  to  other  research  reported  by  Goldberg  [G0L75],  In  that  work,  a number  of 
design  choices  for  the  initial  signal-to-symbol  transformation  were  investigated  within  the 
framework  of  straightforward  segmentation  and  labeling  algorithms.  The  segmentation  and 
labeling  programs  in  that  study  were  developed  to  be  relatively  Independent  of  such  a 
design  dimension  as  input  parametric  representation.  They  rely  rather  heavily  upon 
empirical  knowledge  about  the  parametric  (acoustic)  nature  of  phonetic  phenomena,  and 
upon  some  basic  methods  of  statistical  pattern  classification.  Good  performance  was 
achieved  with  these  algorithms,  and  some  of  the  design  choices  were  shown  to  be 
irrelevent  to  the  goal  of  achieving  accurate  machine  transcription  of  connected  speech  at 
the  acoustic-segmental  (sub-phonemic)  level  of  representation.  This  working  paper 
consists  of  a brief  summary  of  the  comparative  performance  evaluations  reported  by 
Goldberg  [GOL75J  and  of  the  particular  configuration  currently  being  used  as  a front  end 
transcriber  for  Hearsay  II. 

Background 

Although  most  of  our  knowledge  about  how  to  recognize  and  understand  speech  is 
taken  from  human  performance,  the  structure  of  computer  speech  understanding  systems 
and  speech  recognition  schemes  is  of  particular  importance  to  this  study.  Knowledge 


t This  research  was  supported  in  part  by  the  Defense  Advanced  Research  Projects 
Agency  under  contract  no.  F44620-73-C-0074  and  monitored  by  the  Air  Force  Office  of 
Scientific  Research. 
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about  speech  is  generally  organized  into  separate  sources  of  Knowledge,  which  work  with 
a representation  of  the  information  content  of  the  input  utterance.  These  representations 
may  exist  at  a number  of  different  levels,  as  suggested  by  their  elements:  speech  sounds, 
phonetic  gestures,  phonemes,  syllabels,  words,  syntactic  units,  concepts,  etc.  In  evaluating 
the  performance  of  recogmtion  processes  at  tne  parametric  representation  level,  we 
eliminate,  as  much  as  possible,  the  effects  of  ambiguities  from  other  levels.  Such 
ambiguities  as  coarticulation  or  phonological  variation  will  strongly  affect  the  degree  to 
which  the  expected  transcription  of  an  utterance  corresponds  with  the  acoustic 
performance.  A great  difficulty  in  comparing  published  results,  is  that  the  level  of  the 
knowledge  used  in  recognition  and  the  representation  used  for  evaluation  is  not  usually 
specified.  Usually,  only  total  system  performance  may  be  compared,  not  the  effectiveness 
of  component  methods. 


Parametric  Representations 

Parametric  representations  fall  into  a few  major  types,  and  typical  examples  of  each 
have  been  chosen  for  study.  A bank  of  broad  bandwidth  filters  (ZCC)  with  amplitude  and 
zero-crossing  measurements,  and  a bank  of  narrow  band  filters  (ASA),  amplitude  only, 
represent  analog  methods.  A digital  Fourier  transform  of  the  LPC  filter  [MAR72]  produces 
a smoothed  spectral  envelope  (SPG)  very  much  in  current  use.  Finally,  the  autocorrelation 
sequence  (ACS)  is  employed  with  a special  method  designed  for  it.[ITA75]  Each  method 
yields  a set  of  measurements  at  uniform,  short  intervals  --  a pattern. 


Distance  Metrics 

Distance  functions,  chosen  from  Pattern  Classification  theory,  are  then  applied  to  the 
parameter  patterns  as  measures  of  acoustic  similarity.  The  basic  model  adopted  from  that 
theory  is  that  of  a vector  of  parametric  measurements  for  each  pattern.  These  vectors 
define  a space  of  possible  patterns,  within  which  a measure  of  distance  may  be  applied 
between  patterns.  As  populations  of  sample  patterns  are  accumulated,  better  statistical 
descriptions  may  be  estimated  of  the  true  distribution  of  those  patterns  in  the  space.  A 
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simple  example  might  be  to  collect  all  the  occurences  of  a phone,  and  compute  the  mean 
and  variance  of  each  dimension.  Then  a suitable  measure  of  similarity  might  be  Euclidean 
distance,  weighted  by  variance,  to  approximate  a measure  of  the  deviation  from  population 
mean.  This  is  one  distance  metric  chosen  (SIG).  The  others  are  Euclidean  distance  (EUC), 
Correlation  (COR)  --  the  magnitude  normalized  dot  product,  and  Maximum  Likelihood  (LIK). 
In  this  last,  the  population  covariance  matrix  is  used  to  calculate  Prfunknown  produced 
from  population},  under  the  assumption  of  Gaussian  distributions. 

Segmentation 

A method  for  segmenting  speech  into  isolated,  acoustically  consistent  segments  is 
presented.  The  method  is  fairly  independent  of  the  choice  of  parametric  representation, 
since  it  relies  upon  the  acoustic  similarity  measure  as  the  primary  evidence  of  acoustic 
change.  First,  however,  a threshold  is  applied  to  the  signal  amplitude  measurement  to 
detect  speech/silence.  Then  the  speech  portion  is  examined  further.  In  collecting 
evidence  for  a segment  boundary,  a measure  of  change  is  applied  to  neighboring 
parameter  patterns.  This  produces  a time  sequence  of  values,  whose  peaks  are  detected 
and  subjected  to  a threshold  for  acceptance  or  rejection.  A composite  of  such  functions 
yields  the  final  segmentation.  Narrow  and  broad  pattern  similarity,  as  well  as  amplitude 
change  are  the  three  functions  applied  during  speech  portions  of  the  signal.  This  process 
is  very  much  like  the  process  hypothesized  in  the  basic  model  for  Signal  Detection 
[EGA64],  That  model  may  be  applied  to  the  problem  of  evaluating  segment  boundary 
"detectability." 

Missing  and  extra  segment  errors  are  found  to  be  as  good  as  4%  and  19£, 
respectively.  Significant  differences  in  the  segmentation  effectiveness  of  the  parametric 
representations  is  found,  They  may  be  ordered  as  follows:  SPG,  ACS,  ASA,  and  ZCC.  The 
best  performance  is  found  to  be  comparable  to  the  state  of  the  art.  Littie  reduction  in 
accuracy  is  encountered  when  new  speakers  are  tested. 

Table  1 shows  the  results  of  segmentation  for  40  sentences  from  the  News  Retrieval 
task,  one  speaker.  The  reference  segmentation  contains  1082  segments  primarily  at  the 
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phonemic  level  of  description.  The  second  reference  contains  corrections  to  this  file,  to 
make  it  more  an  acoustic  description  of  the  corpus.  It  has  1541  segments,  he  number  of 
machine  segments  reported  may  be  greater  than  the  sum  of  this  number  ^hand  reported 
acoustic  segments)  and  the  number  of  extra  boundaries.  The  discrepancy  is  merely  the 
result  of  the  way  we  evaluate  segmentation  by  boundaries.  Occasionally,  two  machine 
boundaries  will  fall  close  enough  to  a hand  boundary  to  both  be  accepted.  Such  segments 
must,  therefore,  be  very  short,  and  are  usually  transition  segments  w h may  be  easily 
detected  at  higher  levels.  The  number  of  missing  boundaries  (segments),  divided  by  the 
number  of  boundaries  which  are  included  in  both  reference  segmentations,  is  the  missing 
segment  error  rate.  The  number  of  shifted  boundaries  is  also  divided  by  this  number.  The 
number  of  extra  boundaries  is  divided  by  the  number  of  primary  segments  (the  size  of  HI 
in  this  case).  The  extra  segment  rates  in  parentheses  are  those  where  division  is  by  the 
number  of  acoustic  segments  (■'ize  of  H2).  The  value,  d\  is  a single  measure  of 
detectability  from  the  Signal  Detectio'  model.  It  has  the  effect  of  normalizing  for  the 
trade-off  between  missing  and  extra  segment  errors. 
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U 

28 

34 

45 

41 

% 

2.8 

3.4 
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4.1 

d* 

2.38 

1.93 

1.58 

1.29 

(2.65) 

(2.24) 

(1.91) 

(1.77) 
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Labeling  is  accomplished  by  the  same  pattern  similarity  distances  metrics.  Given  a 
set  of  phonetic  elements  as  the  recognition  targets,  a set  of  templates  for  each  target  is 
collected  from  the  training  data.  This  is  achieved  by  a clustering  algorithm  developed  with 
the  purpose  in  mind  of  encoding  some  of  the  ambiguities  encountered  in  phone 
performance  into  the  set  of  templates.  The  pairwise  distances  are  computed  for  all  pairs 
of  sample  patterns  in  the  training  population  for  a particular  phonetic  target.  Then  a 
threshold  is  chosen  from  these  values,  and  the  distances  below  threshold  are  marked.  The 
sample  pattern  in  the  most  marked  pairs  is  chosen  as  a representative  template  and  all  its 
marked  mates  are  discarded,  After  iterating,  the  population  is  divided  into  clusters  of 
various  sizes,  each  with  a "best"  representative  template  pattern.  Clusters  of  sufficiently 
small  size  are  ignored. 

Labeling  itself  proceeds  by  computing  the  distance  from  the  unknown  pattern  to 
each  template.  In  addition  to  the  distance  metrics  mentioned,  three  prosodic  features  of 
each  segment  --  the  average  amplitude,  the  duration,  and  the  amplitude  contour  of  the 
surrounding  segments  --  are  used  to  increase  the  distances  to  templates  whose  prosodic 
features  were  considerably  different. 

The  set  of  templates  (and  their  appropriate  target  labels)  and  the  distance  scores 
give  the  total  recognition  information  available  from  thi-  straightforwa.  d labeler.  If  some 
criterion  is  placed  on  the  templates  which  one  is  willing  to  report  to  the  rest  of  a system, 
then  accuracy  may  be  measured  as  a function  of  the  severity  or  looseness  of  that 
criterion.  If  the  true  effect,  upon  a speech  recognition  system,  of  loosening  the 
acceptance  criterion  is  to  be  understood,  one  must  also  measure  the  expected  number  of 
separate  targets  reported  at  each  instance.  We  call  this  the  Branching  Factor,  and  collect 
it  as  well  as  accuracy  statistics  in  evaluating  labeling  performance. 

Little  difference  is  observed  along  the  parametric  representation  or  the 
classification  metric  dimensions,  except  for  poorer  performance  for  ZCC  input.  Each  input 
segment  is  labeled  as  one  of  a set  of  40  phone  labels.  The  correct  phone  appears  as  the 
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first  choice  28 7 of  the  time.  It  appears  in  the  first  three  choices  557.  of  the  time. 
However,  when  a lower  level,  acoustic  transcription  is  used  as  evaluation  referent,  these 
values  increase  to  427.  and  657.  Even  the  287  accuracy,  which  arises  from  a comparison 
against  phonemic  expectation,  is  acceptable  performance.  It  is  the  same  as  or  slightly 
better  than  human  spectrogram  reading  performance  in  the  absence  of  other  linguistic 

clues  [SH074] 

Table  2 shows  ove”a!l  labeling  accuracies  for  the  four  parametric  representations. 

p . SPG  ASA  ZCC  AeS 

1 24.G(1.0)  27.1(1.0)  20.3(1.0)  28.7(1.0) 

2 42.4(1.3)  39.1(1.9)  31.4(1.9)  44.4(1.9) 

3 54.0(2.8)  50.4(2.8)  42.0(2.8)  54.6(2.7) 

Table  2s  Parametric  Representation  Dimension 


The  distance  metric  is  the  Euclidean  distance  function1',  and  a set  of  40  phonetic 
recognition  targets  is  used.  The  values  reported  for  position  p are  Pr{correct  template  in 
position  < p}.  The  expected  number  of  different  targets  (branching  factor)  is  given  in 

parenthesir. 

Figure  1 is  a graphic  display  of  accuracy  versus  Branching  Factor  for  the  SPG/SIG 
experiment.  Five  plots  are  given,  identified  by  the  size  of  the  target  set  used  in  each 
evaluation.  The  BF  plot  gives  a particularly  convenient  view  of  accuracy  against  the 
demands  that  will  be  made  upon  higher  levels  by  excess  options  in  recognition. 


Contributions 

The  major  contributions  of  this  research  are  in  three  areas.  First,  a clear  picture  is 
provided  of  the  segmentation  and  labeling  performances  available  from  standard 
parametric  representations.  An  ordering  can  be  made  of  the  representations,  for 
segmentation  effectiveness,  which  agrees  fairly  well  with  existing  beliefs  about  the 


+ The  ACS  representation  was  run  only  with  ltakura’s  log  ratio  distance. 
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information  content  of  these  representations.  The  fine  spectral  parameters  seem  to 
contain  more  of  the  relevant  information  about  speech.  Labeling  performance  is  counter- 
intuitive, in  that  it  does  not  seem  to  matte  which  representation  Is  chosen.  However,  this 
agrees  well  with  the  rapidly  growing  belief  that  the  bulk  of  the  error -inducing  ambiguities 
of  human  speech  are  of  a higher  level  nature. 

In  addition,  methods  are  presented  for  segmenting  and  labeling  speech  in  a 
straightforward,  parameter-independent  manner.  These  methods  perform  well,  for  their 
simplicity,  when  compared  with  both  human  and  machine  results  at  the  same  low  level. 
They  are  not  meant  as  a substitute  for  higher  level  knowledge  sources,  but  rather  as 
available  tools  or  lower  bounds  on  acceptable  performance  for  the  initial  signal-to-symbol 
analyses.  A methodology  for  evaluating  performance,  which  is  closely  allied  to  our  view  of 
the  separability  of  the  levels  of  representation,  is  offered,  A parameter,  d\  of  the  signal 
detection  model  appears  useful  as  a broad  estimate  of  behavior. 
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Finally,  a viewpoint  Is  pot  forth  concerning  the  role  ol  pattern  classitication 
techniques  in  recognition  processes  of  this  level.  It  is  our  belief  that  more  serious 
application  of  these  methods  - particularly  ol  methods  tor  training  which.respond  to  the 
empirical  data,  rather  lhan  to  a priori  ideas  about  speech  - will  yield  considerable  result 

in  the  near  future. 

Application  to  Hearsay  II 

The  current  configuration  of  Hearsay  II  has  segmentation  and  labeling  of  the  input 
parametric  representation  performed  by  a separately  running  program.  Hence,  there  are 
no  interactions  with  the  other  knowledge  sources.  The  parametric  representation  used  for 
most  of  the  Hearsay  II  research  is  the  LPC  spectrum  (SPG).  Typically,  segmentat.on 
knowledge  acquired  by  training  analysis  of  one  male  speaker  is  used  for  all  other  male 
speakers  with  no  signihcant  degradation  of  performance.  Labeling  knowledge  is  more 
speaker  specific,  and  training  is  performed  for  each  speaker,  produc  ng  a set  of  about  100 
templates  for  approximately  40  commonly  occuring  phones.  Euclidean  (EUC)  or  euclidean 
with  variance  (SIG)  distance  metrics  are  used  to  provide  a set  of  5 templates  (fewer  if 
there  are  not  enough  templates  close  to  the  input  pattern)  with  which  each  machine 

segment  is  labeled. 

The  resultant  transcription  is  intended  to  represent  the  utterance  at  an  extremely 
low  level.  Segmentation  is  tuned  to  miss  as  few  segments  as  possible,  and  therefore 
yields  a large  number  of  segments  which  may  be  considered  extra.  (These  extra  segments 
are  very  often  indications  of  transition  segments  between  phones  or  of  regions  of  intra- 
phone variation.)  The  multiple  template  labeling  information  is  also  designed  to  omit  as 
little  possibly  relevent  information  as  possible.  Hence,  a major  task  of  one  knowledge 
source  in  Hearsay  if  is  to  consolidate  and  select  from  this  input  with  the  aid  of  speech 
specific  knowledge  at  the  phonetic  and  phonological  levels. 


t See  Shockey  and  Adam’s  paper  in  this  collection  concerning  PSYN. 
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/,  INTRODUCTION 

A central  problem  for  speech  understanding  systems  consists  of  efficiently  and 
accurately  determining  what  words  at  the  lexical  level  are  implied  by  the  data  at  lower 
levels.  As  speech  systems  permit  larger  vocabularies  and  languages  with  loss  restricted 
syntax  and  semantics,  they  must  depend  moro  on  bottom-up  methods  to  limit  the  search 
space  of  possible  word  sequences.  A bottom-up  word  hypothesizor  is  presented  which 
uses  classes  of  syllables  (called  SYLTYPES)  to  support  word  hypotheses.  A Markov 
probability  model  relates  a lower  level  representation  (in  this  case  phones^)  to  a 
sequence  of  states  defining  a SYLTYPE.  Words  are  suggested  by  these  SYLTYPES  (using 
an  inverted  lexicon)  and  possibly  pruned  (for  multisyllabic  words)  depending  on  adjacent 
SYLTYPES.  The  definition  of  SYLTYPES  is  made,  based  on  the  training  data  and  the  word 
vocabulary,  to  optimize  the  word  hypofhesization  accuracy.  The  method,  as  implemented 
in  POMOW  for  the  Hearsay  II  speech  system,  is  shown  to  reduce  the  search  space 
effectively  for  a vocabulary  of  J000  words. 

In  this  working  paper  we  will  cover  five  topics:  1)  background  on  the  problem  of 
word  hypofhesization,  2)  supporting  ideas,  3)  implementation  of  these  ideas  for  the  HSII 
system,  4)  results  to  date,  and  5)  problems  to  investigate.  The  work  discussed  here  is  in 
a changing  state.  Some  of  the  methods  have  been  chosen  because  they  "seem  right"  and 
have  not  been  tested  thoroughly.  Other  methods  have  been  selected  for  their  ease  of 
implementation  and  may  yield  to  better  ideas, 

^ This  research  was  supported  in  part  by  the  Defense  Advanced  Research  Projects 
Agency  under  contract  no.  F44620-73-C-0074  and  monitored  by  the  Air  Force  Office  of 
Scientific  Research. 

^In  this  paper  we  will  use  "phone"  to  refer  to  a sound  detected  and  classified  by  a 
program  and  "phoneme"  for  the  expectation  of  that  sound  as  entered  In  a word 
pronunciation  dictionary. 
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II.  FROBLEM  BACKGROUND 

A central  problem  for  speech  understanding  systems  consists  of  efficiently  end 
accurately  extracting  information  from  different  sources  of  knowledge  and  applying  it  to 
determine  what  words  at  the  lexical  level  are  best  represented  in  the  data  at  lower 
levels  (e.g.,  phones,  acoustic  segments,  parameters). 

Two  types  of  strategies  for  applying  this  information  are  top-down  strategies  and 
bottom-up  strategies.  Top-down  strategies  (also  called  "analysis  by  synthesis") 
determine  which  is  the  best  word  at  the  lexical  level  for  some  region  of  time  by 
transforming  all  possible  words  into  the  representation  of  a lower  level,  comparing  each 
word  at  that  level,  scoring  each  word,  and  choosing  the  word  with  the  highest  score. 
These  strategies  are  feasible  only  in  task  domains  which  have  small  vocabularies  and 
strong  syntactic  and  semantic  constraints,  and  ere  found  In  most  first  generation  speech 
systems.  The  eariy  Lincoln,  SDC,  SRI,  Sperry  Univac  and  HEARSAY  I systems  are 
examples.  However,  when  vocabularies  become  large  and  syntactic  and  semantic 
constraints  become  weak,  top-down  strategies  are  lost  in  the  combinatorics  of  potential 
word  sequences.  The  Dragon  speech  system  (Baker  1975)  must  be  classified  as  using 
both  a top-down  strategy  and  a bottom-up  strategy  since  it  essentially  does  both  at 
once.  The  Dragon  system  does  have  the  same  sensitivity  to  word  vocabulary  size  that 
top-down  strategies  have. 

Bottom-up  strategies,  as  in  POMOW,  attempt  to  infer  from  the  information  at  the 
lower  level  what  the  words  are  at  the  lexical  level.  The  ideal  bottom-up  strategy  would 
propose  only  one  word  at  the  lexical  level  (later  verified  to  be  correct)  for  each  ^ord- 
sized  region  in  the  lower  level  representation.  Size  of  vocabulary  and  degree  of 
syntactic  and  semantic  constraints  might  effect  the  efficiency,  but  not  the  accuracy,  of 
this  ideal  bottom-up  strategy.  It  would  have  to  collect  and  use  the  right  Information  at 
the  lower  level  to  point  to  the  correct  word  and  only  the  correct  word  for  each  word- 
r.ized  region  of  the  utterance.  Of  course  the  errorful  and  ambiguous  nature  of  speech 
prevents  such  an  ideal  bottom-up  strategy,  but  we  can  try  to  approach  this  ideal.  That 
is,  we  can  limit  the  number  of  proposed  words  per  word-sized  region  without  usually 
excluding  the  correct  word,  by  choosing  and  using  the  most  effective  Information  from 
the  lower  level. 

The  information  we  use  to  support  word  hypothesis  and  how  we  group  this 
information  into  units  define  what  wa  wili  call  "units  of  support".  Three  possible  criteria 
for  good  "units  of  support"  might  be: 
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1.  Robuslness  - effected  little  by  surrounding  speech. 

2.  Few  units  relative  to  word  vocabulary  size. 

3.  Informationally  rich  - each  unit  proposes  few  words. 

If  a unit  of  support  is  robust  it  will  reliably  indicate  the  presence  of  a word  or  a 
set  of  possible  words  in  various  contexts  and  under  different  speaker  conditions.  The 
second  criterion  prevents  the  original  combinatoric  problem  of  hypothesizing  words  from 
being  replaced  with  the  same  problem  of  hypothesizing  these  support  units.  The  last 
criterion  points  us  in  the  direction  of  the  ideal  bottom-up  hypothesizor.  Unfortunately 
criterion  one  seems  to  be  weak  at  ail  levels  of  representation  in  speech  snd  criteria  two 
and  three  tend  to  vary  inversely.  By  choosing  "units  of  support"  we  are  introducing  a 
new  level  of  representation  between  the  word  level  and  the  lower  level  representation. 
We  want  to  optimize  these  criteria  to  give  the  most  efficient  and  accurate  bottom-up 
strategy. 

Speech  understanding  systems  which  have  included  a bottom-up  strategy  have 
used  various  "units  of  support".  One  (rarely  used)  version  of  Hearsay  I (Reddy,  Erman, 
and  Neely  1973)  used  gross  features  (such  as  the  "SH"  in  "Bishop")  to  retrieve  those 
words  from  the  lexicon  that  included  the  particular  feature  present  in  the  utterance.  The 
1974  BBN  Speechlis  system  (Rovner,  et  al.  1974)  used  any  pair  of  adjacent  phones  as  a 
unit  of  support  for  its  bottonv-up  strategy.  In  each  phone-sized  segment  several 
alternative  phonemes  might  be  hypothesized.  Each  unique  pair  of  hypothesized 
phonemes  in  every  pair  of  adjacent  segments  is  used  to  retrieve  the  set  of  words  from 
the  lexicon  containing  that  pair  of  phonemes. 

III.  SUPPORTING  IDEAS 

What  might  be  the  best  units  of  support  for  a bottom-up  strategy?  Gross  features 
have  few  units  and  tend  to  be  robust  but  are  not  informationally  rich.  Phoneme  pairs 
may  contain  more  information  but  cannot  be  considered  robust.  Two  different  ideas 
influenced  the  selection  of  the  type  of  units  of  support  for  POMOW.  The  first  is  from 
Fujimura  (1974)  who  proposes  that  the  syllable,  phonologically  redefined,  serve  as  the 
effective  minimal  unit  of  speech  recognition.  Though  he  is  primarily  concerned  with  their 
use  in  a top-down  strategy  of  template  matching,  his  reasons  for  using  the  syllable  also 
apply  to  bottom-up  strategies.  He  defines  an  ordering  of  consonantal  elements  according 
to  their  "vowel  affinity"  (vowels  having  maximum  affinity).  A syllable  Is  then  the  segment 
from  one  minimum  in  the  affinity  contour  of  the  utterance  to  the  next  minimum.  He 
argues  that  syllables,  and  especially  stressed  syllables,  are  more  robust  than  phonemes. 
Coarticulatory  effects  across  a syllable  boundary,  when  such  a boundary  is  definable,  are 
much  less  and  more  easily  handled  with  phonological  rules  then  are  the  effects  among 
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phonemes  within  a syllable.  It  seems  that  a syllable  contains  enough  informetion  so  that 
any  syllable  would  propose  only  a small  fraction  of  the  word  vocabulary.  A possible 
weakness  of  using  the  syllable  as  a unit  of  support  is  the  number  of  units.  The  number 
of  syllables,  as  Fujimura  defines  them,  is  on  the  order  of  a few  thousand.  We  can  limit 
this  number,  and  unfortunately  decrease  the  information  content  of  each,  by  putting  the 
syllab'es  into  equivalence  classes;  we  call  these  svltypes, 

The  definition  of  syltypcs  also  had  influence  from  another  source.  Baker’s  (1975) 
use  of  a Markov  probability  model  in  his  Dragon  speech  system  encouraged  us  to  define 
the  syltypes  by  a state-sequence  and  use  a similar  model  for  deriving  them  from  the 
underlying  phone  sequence. 

There  are  two  extremes  a state  sequence  definition  of  syltypes  could  take  which 
will  produce  the  same  number  of  units.  The  first  is  to  form  a syltype  from  a short 
sequence  of  many  different  states.  For  example,  we  could  use  a sequence  given  by  a 
prenucleus  state  derived  at  a low  amplitude  point,  a nucleus  state  at  a local  high 
amplitude  point,  and  a postnucleus  state  at  the  next  low  amplitude  point.  The  second 
extreme  lies  in  the  direction  of  doing  more  segmentation  to  produce  a longer  sequence  of 
states  with  relatively  fewer  different  states,  POMOW’s  definition  of  syltypes  tends 
toward  this  second  extreme  with  segmentation  being  on  the  order  of  phone  lengths  and 
the  states  being  equivalence  classes  of  phonemes.  The  definition  of  syltypes  can  be 
easily  changed  within  this  framework  by  redefining  phoneme  equivalence  classes  for  the 
phonemes  and  specifying  legal  state  transitions.  The  decision  to  use  this  framework  has 
not  been  completely  analyzed  and  was  originally  made  because  POMOW  was  to  use  phone 
hypotheses  as  input.  This  is  not  necessary.  In  fact  one  version  of  POMOW  has  used 
lower  level  acoustic  segments  as  input  together  with  the  same  definition  of  syltypes  as 
was  used  with  phone  input. 
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P-I.IKCj 

flE,nn,nii,no,nx 

I-LIKE: 

IY, IH,EY,EH, IX, AY 

U-LIKE: 

0W,UII,U,UW,ER,nW,0V,EL,Et1,EN 

LIQUID: 

Y,U,R,L 

NA!ifil.: 

fl.N.NX 

STOP: 

P,T,K,B,0,G,0X, 

FRIC: 

HH,F,TH,S,SH,V,OH,Z,ZH,CH, JH,WH 

Figure  1: 

Phoneme  Equivalence  Classes, 

The  current  definition  of  syltypes  is  based  on  grouping  the  phonemes  Into  seven 
classes:  A-like,  I-like,  and  U-like  vowels,  liquids,  nasals,  stops,  and  fricatives.  Figure  1 
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gives  the  class  membership  for  the  phonemes.  Each  class  contains  two  states  depending 
on  which  side  of  the  syllable  nucleus  the  phoneme  appears  (e.g.,  phoneme  "T"  is  mapped 
to  a Stoplleft  state  if  it  preceeds  the  syllable  nucleus).  Vowels  are  also  mapped  to  left 
and  right  states.  Typical  state  transitions,  found  in  a 275  word  vocabulary,  are  described 
by  the  network  given  in  Figure  2.  For  example,  let  the  above  phoneme  classes  be 
represented  by  the  symbols  A,I,U,l,N,P,  and  F respectively.  The  word  "AIRPLANES",  with 
the  pronunciation  <EH  R>  <P  L EV  N Z>,  is  mapped  into  the  syltypes  IL  and  PLINF. 


Figure  2:  Syltype  State  Network 

How  many  syltypes  does  this  definition  permit''  Even  with  preventing  cycles 
between  the  Stop  and  Fricative  states  (for  both  the  left  and  right  parts  of  the  network), 
there  are  more  than  900  syltypes.  However,  the  number  that  are  encountered  in  a fixed 
vocabulary  is  usually  smaller.  The  graph  in  Figure  3 of  word  vocabulary  size  versus 
number  of  unique  syltypes  (for  five  vocabularies  we  have  used)  shows  that  a vocabulary 
of  1000  words  has  about  250  syltypes.  The  syltypes  are  informationally  rich  since  on 
the  average  each  will  indicate  the  presence  of  only  a small  fraction  (27.)  of  the 
vocabulary.  In  the  worse  case,  a particular  syltype  occurs  In  107.  of  the  vocabulary. 
Figure  4 is  a sample  from  a syltype  lexicon. 
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Figure  3:  Words  versus  Syltypes 


Syityp 

Str»ss 

Ward 

PH 

1 

R 

0 

American 

1 

RND 

0 

MUSEUM 

1 

OH 

1 

ONTRRIO 

0 

VETERPN 

PHP 

1 

RNn 

ANF 

1 

NEM-ORI.ERNS 

PL 

1 

ALBERTA 

2 

ALCOHOL 

1 

ALCOHOL 

1 

ALL 

1 

ARCRRO 

1 

ARE 

2 

ARPfl 

Figure  4:  Sample  from  a Syityp  Lexicon. 


How  should  we  calculate  the  probability  of  taking  a particular  path  through  the 
syltype  state  network  given  a sequence  of  phones?  That  is,  how  do  we  assign 
probabilities  to  different  syltypes?  Let  Y[1:Q]  be  a sequence  of  phones  and  X[1:Q]  be  a 
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sequence  of  syltype  states  forming  some  syltype.  We  want  to  find  the  probability  of 
X[  1 :Q]  given  that  the  phone  sequence  V[  1 :Q]  has  been  detected.  This  can  be  calculated 
by: 

Eq(  1 ) P<  X[l:0]"x[l:Q]  | Y[l:Q]»y[l:Q]  ) - 

nq_i  .q  P<  X(q)«x(q)  | X[l:q-l>x[l:q-l],  Y[l:Q>y[l:q]  ). 

Using  the  Markov  assumption  that 

P<  X(q)-x(q)  | X[l:q-l]»x[l:q-l] ) - P(  X(q)«x(q)  | X(q-l)«x(q-  ) ) 

and  the  assumption  that  the  information  in  the  sequence  y[l:q-l]  is  sufficiently  encoded 
in  the  state  x(q-l)  we  have: 

Eq(2)  P(  X[l:Q]«x[l:Q]  I Y[l:Q]"y[l:Q]  ) - 

nq„i;Q  P(  X(q)“x(q)  I X(q-l)=x(q-l),  Y(q)«y(q)  ). 

For  q~l  we  use  P(  X(l)-x(l)  | Y(i)-y(l)  ) in  the  product  of  the  right  hand  side  of 
Eq(2).  The  Dragon  Speech  System  uses  a similar  equation  and  employs  a dynamic 
programming  method  to  find  the  word  sequence  with  the  highest  probability.  However, 
the  goa's  and  restrictions  of  Dragon  differ  from  POMOW.  Dragon  must  determine  what 
one  utterance  was  spoken,  so  choosing  the  most  probable  path  in  its  state  network  is  the 
best  decision.  But  POMOW  lias  limited  knowledge  about  speech  and  must  work  together 
with  other  knowledge  sources  (KSs).  It  should  find  a set  of  most  probable  paths  In  its 
network,  i.e.,  syltypes,  each  rated  so  that  another  KS  can  decide  which  one  is  best. 
Dragon  is  restricted  from  doing  this  computationally  by  the  number  of  states  needed  to 
describe  the  whole  utterance.  With  only  16  states  (the  number  of  nodes  In  Figure  2), 
POMOW  can  afford  to  investigate  many  paths. 

To  give  a more  reliable  common  starting  point  for  finding  alternative  syltypes,  the 
sequence  of  phones,  and  therefore  the  syltype  state  network,  is  not  traced  left  to  right 
(i.e.  in  the  direction  of  increasing  time)  but  from  the  syltype  nucleus  out  to  the  ENDiLEFT 
state  and  then  from  the  nucleus  to  the  ENDIRIGHT  state,  as  indicated  by  the  arrows  In 
Figure  2.  Hence  X(l)  will  be  an  AV0W1LEFT,  IV0W1LEFT,  or  UVOWILEFT  stcte. 

Once  we  have  syltypes  supported  by  the  observed  phones,  we  must  find  the 
words  which  best  fit  the  syltypes,  The  ideas  behind  the  hypothesizatlon  of  words  from 
the  syltypes  are  less  sophisticated.  Each  hypothesized  syltype  suggests  a set  of  words 
to  hypothesize.  A particular  word  is  included  in  the  set  because  it  has  a syllable  which 


Word  Hypothesis  135 


R.  Smith 


„,a05  lo  the  syltype  and  the  syllable  was  marked  in  the  word-phonem.  dictionary 
(currently  by  into, lion)  as  having  enough  stress  to  reliably  indicate  the  words  presence 
be  utterance  Mult, syllabic  words  in  this  se,  are  rented  it  they  match  poorly  w 
adjacent  syl.ype  hypotheses.  The  match  uses  <a  ye.  untested)  measure  £ 

conditional  probability  that  a word's  syltype  occured  given  that  a parl.cu I.  yp 
syltype  is  observed.  Words  no,  rejected  are  rated  and  pu,  in  the  HSU  data  base  to, 

other  KS’s  to  check. 

w IMPI  F.MENTATION 

The  implementation  consi-.ts  of  three  programs:  a module  for  the  HSH  syst^ 
containing  the  KS’s  POM  and  MOW,  (the  syltype  hypothesis  and  the  word  ypo  esiz  r, 
spectively),  and  ini.ia.ization  programs  for  each,  ca„ed  POMN1T  and  MOWN  T.  R.ur.  B 
ilJ, rates  hciw  these  programs  are  connected  and  used.  Th.  implementation  ,s  bes 
described  by  this  ligurc  and  the  hgures  re.erenced  by  It.  We  will  e.pl.'n  some  0, 
more  obscure  points. 

The  syntax  ol  the  word-phoneme  dictionary  (Figure  6)  permits  an  AND/OR  tree  or 
the  possibie  phonemes  in  a word.  Parentheses  and  commas  indicates..  OR  group  end 
concatenation  ol  the  elements  (phones  or  syllables)  indicate  an  AND  group  Angle 
brackets,  are  syllable  boundaries,  with  the  number  after  the  opening  brack,  g g 

the  stress  level  ol  the  syllable.  (We  use  0,  1,  and  2 lo  indicate  reduced,  normal  o 
stressed,  respectively,  with  a default  slress  o,  normal).  Currently  these  stress  l.v.1  are 
used  ,0  indicate  to  k»W  whether  or  no,  tbe  word  should  be  hypothesized IV «-  •» 
which  the  syllable  will  map  ,0.  For  example,  the  word  "AND"  will  never  be  hypoth.  zed 
by  the  syltype  V which  the  syllable  "EN'  maps  lo.  A V in  any  OR  group  wbethe 
composed  ol  phonemes  or  syllables)  indicates  that  the  group  may  be  absent.  At  presen 
many  phonological  rules  have  been  put  into  this  diction.ry  as  alternate  pronunciations. 

M0WN1T  uses  the  word-phonem.  dictionary  and  the  phoneme  equivalence  classes 
,0  produce  a syltype  lexicon  and  a word-syltype  data  structure.  These  sr.  basic,  y 
inversions  o,  each  other.  The  number  with  each  entry  in  the  syltype  exicon  (Figure  4 
the  .dress  level,  a second  number  (not  shown)  associated  with  each  entry  is  a point, 
into  the  word-syltype  data  structure  (Figure  7).  Th,  inlormstion  In  this  structure  Is  us. 
by  MOW  to  tind  the  syllable  structure  ol  a word  and  by  other  KS's  for  more  detailed 

verification  of  the  word. 

POMNIT  uses  the  phoneme  equivelenc,  classes  to  convert  phonetically  hand- 
labeled  utterances  into  state  sequence.  Previous  results  from  th.  HS1I  phono 
hypothesjzor  are  alligned  with  these  states  so  that  frequency  count,  o,  [current 
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ABOUT  U0AX>,0><2R  AM  T > 

ACUPUNCTURE  <2RF  ).'><0Y(UM,RX)xlP  AX  NX  Lx0  (T  ,0  )SH  ER> 

AGRICULTURE  <2AF  Gx8  R <IH,  IX)  xlK  <AX,RA>  L xT  SH  ER> 

RIRPI  RHE  <2(FH,EY>  R xP  L EY  N > 

AIRPLANES  <2  (FH,EY)  R xP  L FY  N 2 > 

RERAN  <2RF  xK  R (RX  N ,EN)> 

flt.nr.Rin  <rf  Lx2R  er  ><tox,T  > rx> 

ALCOHOL  <2RF  L x0L  AXx<HH,0  > RR  L > 

ALL  <R0  L > 

RLLIGftTOR  <2RFx0L  (IH,IX)xlG  EY  ><0  (OX, T )ER> 

RMFRICRN  <0 (RX  H ,EM)x2(t1  ,0)EHx8R  UH,IX)xK  (IX  N ,EN)> 
ANALYSIS  <0RX  ><2N  RF  ><0L  RX  xS  <IH,IX'  S > 

AND  (<0  EN>,  < (RF,  IX)  N (0,0  )>) 

ANIMALS  <2RF  xN  (IH,RX  ,Et1)xM  (RX  L ,EL)2  > 

RNY  <1  (JH,EH)x0N  (IY,IH)> 

RRCRRO  <RR  R x2L  EH  xR  <flX,CH)> 


Figure  6: 

.Sample  from  a Word-Phoneme  Dictionary. 

Word  Syltyp 

SlrsEB 

PronouncIMion 

ABOUT 

A 

0 

RX 

PUP 

2 

8 AM  T 

ACUPUNCTURE 

RP 

2 

AE  K 

LU 

0 

Y UW 

LA 

8 

Y AX 

PRNP 

1 

P RX  NX  L 

PFU 

0 

T SH  ER 

AGRICULTURE 

HP 

2 

RE  G 

LI 

0 

R (IH  , IX  ) 

PflL 

1 

K (RX  ,RR  ) L 

FU 

1 

T SH  ER 

AIRPLANE 

IL 

2 

(EH  ,EY  ) R 

PLIN 

1 

P L EY  N 

Figure  7: 

Example  of  Word-Syltyp  Data  Structure. 

next  state,  next  phone]  triples  can  be  made.  These  are  normalized  to  give  the 
probability  of  going  from  a current  state,  to  a new  state  given  the  next  phone. 
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The  results  to  date  are  given  in  Figure  8.  For  Test  I,  training  and  testing  was  done 
using  20  utterances  which  had  been  phonetically  hand-labeled.  The  training  data  always 
excluded  the  one  utlerance  that  was  currently  being  tested,  i.e.,  each  utterance  was 
tested  based  on  training  on  the  other  19  utterances.  The  same  method  of  training  was 
employed  for  Test  II  and  Test  III  which  used  phonetically  machine  labled  utterances  (from 
another  KS  in  HEARSAY  II),  with  vocabularies  of  275  words  and  970  words,  respectively. 


Test  I 

Test  II 

Test  III 

Accuracy: 

% correct  word3  hypothesized: 

81% 

60% 

54% 

Average  number  of  words 

rated  better  than  correct  word: 

1.0 

6.4 

9.5 

Ac  t i v i ty: 


Average  number  of  words 


generated  per 

utterance  word: 

25 

61 

167 

Average  number  of 

words 

hypothesized  per 

utterance  word: 

3.6 

13.4 

20.9 

Training  Data  Size, 

U of  utterances: 

19 

28 

28 

Test ing  Data  Size, 

ft  of  utterances: 

19 

12 

12 

Uord  Vocabulary  Size: 

275 

275 

965 

Figure  8:  Results. 

In  the  accuracy  statistics  given  above  the  number  of  words  rated  better  than  the 
correct  word  includes  one  half  of  the  number  of  words  with  the  same  rating  as  the 
correct  word.  Words  with  the  same  rating  will  normally  be  those  supported  by  the  same 
syltypes.  The  errors  have  been  analyzed  only  for  Test  I.  Often  (62 7,  of  the  time)  a 
correct  word  which  was  not  hypothesized  was  in  fact  generated,  I.e.,  supporting  syltypes 
were  hypothesized  and  the-  word  was  considered  by  the  module  for  hypothesization. 
However  the  rating  of  the  wo >d  (based  on  its  supporting  syltypes)  was  lower  than 
enough  other  words  in  the  same  region  to  keep  it  from  being  hypothesized.  (A  word  Is 
considered  to  be  in  the  same  region  as  the  correct  word  if  its  middle  third  overlaps  with 
the  correct  word.)  If  another  KS  deletes  some  of  these  competing  words  or  lowers  their 
ratings,  the  correct  word  will  be  hypothesized.  Thus  there  is  a tradeoff  between  the 
percent  of  correct  words  hypothesized  and  the  number  of  words  rated  better  than  the 
correct  hypothesis. 

There  are  two  other  causes  for  POMOW  not  hypothesizing  the  correct  word.  In 
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15*  Of  the  cases  the  error  was  caused  by  insufficient  or  misleads  slate  bans  ion 
statistics  (to  be  explained  in  the  next  section),  237.  of  the  errors  were  caused by  the 
phones  in  the  utterance  differing  too  greatly  from  the  phoneme  pronunciations  found 
the  word  phoneme  dictionary  (discussed  in  the  next  section) 

yr  PunRiFMH  TO  INVESTIGATE 

the  results  show  that  when  using  good  phonetic  input'  POUOW  currently  is  able  to 
reduce  HSlfe  search  spare  of  possible  words  for  817  of  the  words  In  the  utterance  from 
279  words  to  3.6  words  on  the  average  and  to  rate  the  correct  word  so  that  ,1  is  m 
top  two  ranks  (on  the  average).  For  the  machine  labled  phones  this 
POMOW  is  able  to  reduce  HSIfs  search  space  o<  possible  words  lor  607.  (547)  of  he 
utterance  from  275  (965)  words  to  13.4  (20.9)  words  on  the  average  and  to  rate  e 
correct  word  so  that  it  is  in  the  lop  sevon  (ten!  ranks  (on  the  average).  Of  course  the 
tradeoff  mentioned  above  permits  varying  these  numbers. 

The  errors  found  in  Test  I demonstrate  the  problems  that  are  encountered  when 
using  machine  hypothesised  phones.  The  first  problem  is  one  ot  not  finding  the  r,g 
syltype  when  the  correct  sequence  of  phones  is  given.  Some  of  these  errors  are  c.us 
by  insufficient  slate  transition  statistics  and  can  obviously  bo  reduced  by  train  ng  on 
more  utterances.  As  with  all  statistical  methods,  we  must  be  care.ul  to  represen 
„sk  domain  in  the  training  subset.  An  example  0.  such  an  error  is  missing  the  syltype 
"LUP"  because  there  had  been  ro  example  in  the  training  utterances  that  the  phone  W 
preceding  a UVOWU.EFT  state  si,  i indicate  a transition  to  a UQfLEFT  state. 

The  same  problem  comes  from  misleading  state  transition  statistics  Whenever  a 
phone  must  indicate  a transition  to  more  than  one  stale  from  the  same  state,  the  correc 
path  through  the  syltype  state  network  must  share  the  Iota,  probability  0.  a sy  type 
occurring  with  other  paths.  This  problem  is  common  to  statistical  method.  We  a . 
guaranteed  to  bo  right  the  majority  61  the  time  but  not  all  ot  the  time.  The  method  « 
be  right  all  of  the  lime  only  if  there  is  a mapping  ot  (phone,  last-syltype)  state  pairs  onto 
syltype  states.  We  are  just  beginning  to  use  a method  which  minimizes  this  type  of  error 
by  automatically  defining  the  best  phoneme  classes. 

The  second  problem  is  not  finding  the  right  syltype  when  the  sequence  of  phones 
for  a word  differs  loo  greatly  from  the  pronunciations  found  in  the  word-phoneme 
dictionary.  Wo  can  expoc.  this  problem  because  bf  the  nature  of  speech.  The  question  I, 
how  to  best  handle  it.  To  somo  extent  we  can  add  alternate  pronunc, ebons  to  the  word- 

Ipjote  that  "good"  here  meins  a good  phonetic  transcription  ot  whst  wss  ectuslly  spoken 
and  not  en  idealized  (dictionary)  spelling;  thus  it  does  reflect  "reel  speech’. 
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phoneme  dictionary  but  we  don’t  want  to  enter  all  possible  phoneme  sequences  for  a 
word  that  correspond  to  all  possible  phone  sequences  that  the  phone  hypothesizor  might 
generate  for  the  v/ord. 

A solution  to  this  second  problem  has  taken  the  form  of  defining  a metric  to  match 
syllypes  of  words  to  hypothesized  syltypes.  For  example  if  the  syltype  "PAN"  is 
hypothesized  there  is  a significant  probability  that  the  syltype  "PLAN"  occured.  The 
metric  has  not  been  tested  yet  - in  fact  it  may  be  doing  more  harm  than  good  at  the 
moment. 

There  are  many  remaining  areas  to  investigate.  Will  stress  information  permit 
POMOW  to  avoid  working  in  regions  were  it  will  probably  fail?  Can  we  do  better  by 
deriving  the  syitypes  from  a level  of  representation  lower  than  the  phones?  How  much 
training  is  needed  to  obtain  reliable  statistics  for  our  task  domain?  How  much  will  Inter- 
speaker variability  hurt?  These  and  other  questions  will  be  investigated  as  we  continue 
our  research. 
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It  troduction 


Because  of  the  difficulties  involved  in  implementing  a speech-understanding 
system,  recognition  of  the  words  of  an  utterance  is  a process  which  must  incorporate 
knowledge  from  many  different  sources,  including  phonology,  prosodies,  syntax, 
semantics,  and  pragmatics  [Newell  et  at.,  1973J.  The  Hearsay-II  speech  understanding 
system  (HSU)  uses  a multi-level  organization  with  many,  diverse,  cooperating  sources 
of  knowledge  (KSs)  in  an  asynchronous,  parallel-processing  environment.  An 
introduction  to  this  organization  is  provided  in  an  abstract  from  Erman  and  Lesser 
(1975): 

The  hypothesize-and-test  paradigm  is  used  as  the  basis  for  cooperation 
among  many  diverse  and  independent  knowledge  sources  (KSs).  The  KSs  are 
assumed  individually  to  be  errorful  and  incomplete.  A uniform  and  integrated 
multi-level  structure,  the  blackboard,  holds  the  current  state  of  the  system. 
Knowledge  sources  cooperate  by  creating,  accessing,  and  modifying  elements  In 
the  blackboard.  The  activation  of  a KS  is  data-driven,  based  on  the  occurrence 
of  patterns  n the  blackboard  which  match  templates  specified  by  the 
knowledge  source. 

Each  level  in  the  blackboard  specifies  a different  representation  of  the 
problem  space;  the  sequence  of  levels  forms  a loose  hierarchy  in  which  the 
elements  at  each  level  can  approximately  be  described  as  abstractions  of 
elements  at  the  next  lower  level,  This  decomposition  can  be  thought  of  as  an  a 
priori  framework  of  a plan  for  solving  the  problem;  each  level  Is  a generic 
stage  in  the  plan,  The  elements  at  each  level  in  the  blackboard  are  hypotheses 
about  some  aspect  of  that  level.  The  internal  structure  of  an  hypothesis 
consists  of  a fixed  set  of  attributes;  this  set  is  the  same  for  hypotheses  at  all 
levels  of  representation  in  the  blackboard.  These  attributes  are  selected  to 
serve  as  mechanisms  for  implementing  the  data-directed  hypothesize-and-test 


1 This  research  was  supported  by  the  Defense  Advanced  Research  Projects  Agency 
under  contract  no.  F44620-73-C-0074  and  monitored  by  the  Air  Force  Office  of 
Scientific  Research. 
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paradigm  and  for  officiant  goal-directed  scheduling  of  KSs,  Knowledge  sources 
may  create  networks  of  structural  relationships  among  hypotheses.  These 
relationships,  which  are  explicit  in  the  blackboard,  serve  to  represent 
inferences  and  deductions  made  by  the  KSs  about  the  hypotheses;  they  also 
allow  competing  and  overlapping  partial  solutions  to  be  handled  in  an 
integrated  manner. 

The  general  "word  verification  problem"  is  the  following:  Given  (1)  a hypothesis 
that  a particular  word  was  spoken  (including  some  indication  of  where  in  the  utterance 
it  occurred)  and  (2)  a phonetic  transcription  of  the  utterance  (which  is  presumed 
errorful),  determine  (a)  how  likely  it  is,  based  on  the  phonetic  evidence,  that  the  word 
was  indeed  spoken  and  (b)  where  the  word’s  boundaries  lie  (in  the  time  domain). 
Other  features  can  be  included  in  the  problem  specification,  e.g.,  hypotheses  about 
words  which  are  adjacent  to  the  hypothesized  word. 

In  HS II,  this  problem  is  attacked  with  two  knowledge-source  modules.  The  first 
(WOMQS)  respells  the  hypothesized  word  into  expected  "surface-phonemic" 
("surnemic")  spellings,  using  a dictionary  containing  spellings  for  all  words  in  the 
vocabulary.  The  second  (POSSE)  matches  these  surnemic  elements  with  phones 
hypothesized  from  the  acoustic  input.  This  paper  describes  the  general  Issues  and 
strategies  of  word  verification  in  the  HSIl  system. 

Word  Hypotheses  in  HSll 

In  HSU,  word  hypotheses  are  represented  as  blackboard  elements  at  the  Lexical 
level.  These  words  may  be  hypothesized  from  the  phonetic  description  of  the 
utterance  (bottom-up)  [Smith,  1976]  or  from  syntactic  and  semantic  predictions  based 
upon  a partially  recognized  utterance  (top-down)  [Hayes-Roth  and  Mostow,  1976]. 
Associated  with  each  hypothesis  are  times  (begin,  end,  and  duration)  which  are  used  to 
specify  the  time  region  of  the  utterance  in  which  the  word  Is  predicted  to  appear. 
Each  time  specification  also  has  a range  which  is  a measure  of  the  uncertainty 
associated  with  the  time*.  Special  ranges  may  also  be  used  to  indicate:  (•)  no  time 
information  is  available  for  the  hypothesis,  (b)  the  time  specified  is  a lower  bound  for 
the  predicted  time,  and  (c)  the  time  specified  is  an  upper  bound  for  the  predicted  time. 

The  connections  established  between  .elements  at  various  levels  In  the 
blackboard  provide  implicit  structural  information.  If  two  or  more  elements  are 
connected  in  sequence  to  a higher-level  element,  they  are  considered  to  be  tlme- 


1 E.g.,  a begin-time  of  38  with  a range  of  3 indicates  that  the  word  is  predicted  to 
begin  between  35  and  41  centiseconds  after  the  beginning  of  the  utterance. 
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contiguous  (structurally  adjacent)  hypotheses,  Elements  supporting  an  option  node  in 
the  blackboard  are  considered  tc  cover  approximately  the  same  time  region  of  the 
utterance  'competing  hypotheses). 

In  order  to  accomplish  word  verification,  a dictionary  of  expected  word 
pronunciations  is  used.  In  addition,  rules  are  needed  for  handling  phonological 
transformations,  coarticulation  and  other  contextual  effects  On  pronunciations,  word 
boundary  ambiguities,  and  multiple  matching  paths  when  multiple  competing  plausible 
phonetic  hypotheses  exist  in  the  phonetic  description  of  the  utterance.  The  time  and 
implicit  structural  information  (blackboard  connections)  are  also  used  in  verification. 

Design  of  the  Word  Verifier 

In  the  HSU  system,  the  overall  goal  may  be  seen  as  one  of  building  a consistent 
representation  of  the  spoken  utterance  at  each  of  several  levels  of  representation. 
The  inputs  to  the  word  verifier  consist  of  such  representations  at  the  word  and 
phonetic  levels.  In  order  to  bridge  these  levels,  the  word  verifier  creates  a 
representation  of  the  word  at  an  intermediate  level  (called  surface- phonemic)  and 
attempts  to  map  that  representation  to  the  phonetic  transcription.^  The  surnemic  level 
contains  dictionary  phonetic  representations  of  hypothesized  words.  Modifiers,  such 
as  syllable  and  word  boundary  markers,  may  also  be  present  at  this  level. 

The  use  of  the  surnemic  level  is  largely  extra-theoretical  and  cannot  be  easily 
characterized  in  terms  of  general  phonological  theory  [Shockey  and  Erman,  1974]. 
However,  its  use  in  HS1I,  as  described  below,  provides  a data  representation  which 
allows  efficient  resolution  of  the  ambiguities  and  problems  of  word  verification. 

The  process  of  matching  lexical  hypotheses  to  phonetic  hypotheses  thus 
becomes  one  of  matching  surface-phonemic  hypotheses  to  phones.  The  approach 
taken  is  an  incremental  one:  a small  context  of  several  surnemic  and  phonetic 

hypotheses  is  examined  and  a determination  of  plausible  mappings  within  that  context 
is  made.  These  results  are  recorded  in  the  blackboard  by  creating  appropriate  links, 
with  ratings,  between  hypotheses  (as  described  below).  The  time  areas  in  which  these 
matches  are  made  are  thus  small  compared  to  the  time  areas  covered  by  entire  word 


1 In  actuality,  an  additional  level  exists  in  the  system  between  the  word  and  surnemic 
level:  the  syllable,  level.  This  is  used  primarily  by  the  bottom-up  word  hypothesizer 
[Smith,  1976].  The  word  verifier  also  uses  this  level;  when  respelling  words  into  the 
surnemic  level,  they  are  first  spelled  at  the  syllable  level  and  then  at  the  surnemic. 
For  purposes  of  simplicity  of  presentation,  this  paper  will  not  consider  the  syllable 
level. 
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hypothesis.  One  benefit  of  this  finer  granularity  of  activity  is  to  allow  the  HSII  focus- 
of -attention  mechanism  [Hayes-Roth  & Lesser,  1976]  to  schedule  selectively  the 
operation  of  the  KSs  in  different  areas  of  the  utterance.  Also,  the  incremental  natu'e 
of  these  surneme- phone  matches  requires  less  reevaluation  of  matches  when  changes 
in  the  phonetic  data  structure  or  changes  in  hypothesized  word  adjacency  occur. 

Time  Information  at  the  Surnemic  Level 

The  problem  of  limiting  the  search  space  for  matching  dictionary  representations 
of  words  with  phonetic  hypotheses  is  resolved  by  the  use  of  time  and  structural 
information  associated  with  each  surneme  hypothesis  (and  which  is  derived  from  their 
associated  word  hypotheses).  For  vords  hypothesized  from  the  phonetic  level,  a good 
estimate  of  the  time  of  the  vowel  ihone  for  each  syllabic  nucleus  is  provided  by  the 
bottom-up  word  hypothesizer  [Smith,  1976],  This  information  — the  begin  and  end- 
times  --  is  carried  by  the  surnemic  vowel  hypothesis  when  the  word  is  respelled  into 
its  surnemic  representation.  Only  that  time  area  of  the  phonetic  structure 
corresponding  to  those  begin  and  end-times  need  be  searched  to  find  a possible  match 
for  the  surnemic  vowel.  Finding  matches  for  the  other  (structurally  adjacent)  surnemes 
of  the  hypothesized  words  may  then  proceed  outward  incrementary  from  the  vowel. 

StriLcturnl  Adjacency  (Time-Contiguous)  Information  at  the  Surnemic  Level 

Words  hypothesized  from  higher-level  sources  of  Knowledge  (syntax  and 
semantics)  have  limited  time  information  available.  At  best,  the  begin  or  end-time  of 
the  word  is  known,  based  upon  the  end  or  begin  time,  respectively,  of  the  supported 
word  which  predicts  the  new  word  hypothesis. 

Since  the  predictor  word  is  one  which  has  already  received  support  from  the 
phonetic  level,  the  search  for  matches  for  surnemes  of  the  predicted  word  can  use  the 
structural  (contextual)  information  explicit  in  that  support.  Just  as  the  wora  . ifler 
expands  outward  from  the  vowel  surnemes  for  a word  hypothesized  from  belo  (using 
structural  adjacency),  it  also  expands  outward  in  the  direction  of  prediction  from  the 
"end"  (first  or  last)  surneme  of  the  predictor  word  using  the  predicted  structural 
adjacency  across  the  word  boundary. 

Structural  adjacency  implies  time  contiguity.  Therefore,  in  extending  outward 
from  an  existing  match  between  a phone  and- a surneme,  we  need  only  examine  a 
limited  context:  those  structurally  adjacent  surnemes  and  those  time  adjacent  phones 
in  the  direction  of  extension. 
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Implications  of  Matches  Found 

As  the  matching  process  proceeds,  surneme-phone  similarity  scores  are 
retrieved  from  a similarity  matrix  generated  (in  part)  from  training  date,.  If  the 
similarity  for  matching  (linking)  is  higher  than  the  threshold  for  that  type  of  match, 
the? n the?  match  is  made,  giving  support  to  the  surneme.  The  similarity  score  becomes 
the  implication  (or  confidence  rating)  for  the  link.  The  similarity  score  is  context- 
independent.  However,  actual  implications  given  to  links  may  be  modified  by 

phonological  and  surnemic-contextual  rules,  as  described  below. 

Implications  for  matches  may  be  either  positive  or  negative.  The  rating  for  a 
word  is  determined  by  the  implications  given  to  all  of  the  matches  found  for  a 
sequence  of  surnemes  for  the  word.  When  the  word-verifier  is  unable  to  find  a 
plausible  match  within  the  context  specified,  but  rather  finds  contrary  indications, 
negative  implication  links  are  madei  these  serve  to  lower  the  ratings  of  the  word 
hypothesis. 

Multiple  Matches  of  Phones  to  Surnemes 

Since  the  phonetic  representation  of  an  utterance  involves  overlapping  and 
multiple  phonetic  hypotheses,  the  phonetic  data  structure  is  really  a graph  rather  than 
a string  of  phones  1975],  Therefore,  multiple  time  patns  through  the  phonetic 

hypotheses  may  provide  alternative  sets  of  matches  for  the  sequence  of  surnemes  for 
a given  word  (or  sequences  ot  phones  matched  to  a surneme),  It  is  not  Known  which 
allernative  match  will  provide  the  best  overall  rating  for  a word  until  each  path  has 
been  matched.  Mechanisms  exist  within  the  program  for  duplication  of  portions  of  the 
data  structure  from  the  surnemic  level  to  the  word  level.  Several  options  may  exist 
for  matches,  and  the  ratings  given  a word  will  depend  upon  the  best  match  found 
among  the  options.  Some  of  the  mechanisms  used  for  duplication,  and  an  indication  of 
the  decisions  made,  are  shown  in  the  example  below. 

Contextual  Modifications  to  Implications 

As  stated  above,  the  similarity  score  given  any  surneme-phone  match  may  be 
context -independent.  Coarticulation  and  other  contextual  effects,  however,  are 
reflected  in  the  (acoustic-)  phonetic  transcription  of  the  utterance  produced  by  the 
lower-level  Knowledge-sources.  These  effects  in  changing  the  pronunciation  of  words 
may,  in  a context-independent  matching  procedure,  severely  limit  word  recognition.  In 
the  word-verifier,  two  mechanisms  exist  for  handling  contextual  effects  on 
pronunciation: 

Since  alternative  dictionary  representations  exist,  for  '.hose  words 
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which  are  hypothesized  from  the  phonetic  level,  the  pronunciation  . vhich 
most  docely  resembles  the  phonetic  structure  (as  determined  by  the 
bottom-up  word  hypothesized  is  hypothesized  at  the  surnemic  level. 


SURNEMIC-CONTEXTUAL  RULES 


( surnemic  context  ) > /phonetic  hypothesis/  (implication) 


may  be. 

explicit  surnames 
surnemic  daises 


may  ba: 
explicit  phones 
phonetic  feature  »et 


rating  to 
be  given  the 
metch 


Examples: 


( [EH]  ♦ [R] ) > /ER/  (90) 

( A surnemic  [EH]  + [R]  »equenee  may  be  recognized 
as  the  phon  /ER/.  Match  the  [EH]  or  the  [R]  to 
/ER/  with  Implication  90  when  this  context  appears.  ) 


( FRIC  ♦ [DH]  ) > / {*FRC}  / (85) 

( Surname  [OH]  may  not  be  hypothesized  as  a 
phone  when  it  appears  after  a fricative. 
Match  the  [OH]  to  the  frlcated  phone.  ) 


( VOW  ♦ [N] ) > / {+NAS}  / (85) 

( When  the  surname  [N]  appears  after 
a vowel,  match  it  to  a nasalized 
phone  with  implication  85.  ) 


Figure  1.  Surnemic  contextual  rules:  syntax  and  examples. 


In  addition,  phonological  and  surnemic-contextual  rules  are  applied  to 

determine  the  final . implication  for  matches.  The  rules  I tTmi  matches 
format  which  uses  the  contextual  information  available  at  the  t,me  ^8tch^ 
for  a surneme  are  found.  Right-  and  left-contextual  rules  are  ava.lab  e for 
implication  modification.  The  syntax  for,  and  examples  of,  hese  rules  is 
given  in  Figure  1.  In  applying  these  rules,  acoustic  and  artieutatory  featur 
of  the  hypothesized  phones  and  broad  class  membership  of  surnames  and 
phones  may  be  used  in  determining  whether  a given  rule  applies.  In 
addiMon,  explicit  recognition  rules  may  be  used  for  commonly  observed 
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WORD 

LEVEL 

SURNEMIC 

LEVEL 


PHONETIC 

LEVEL 


Context-free  matching: 

[DH]-/DH/  imp*”  100 
[EH]  - /ER/  imp-35 
[R]  - /ER / imp-70 


"THERE"  recoivea  upper  validity  of(68 


Using  the  surnemic-contextual  rule: 
( [EH]  + [R] ) > /ER/  (90) 

the  following  matches  are  made: 

[UH3-/DH/  imp-100 
[EH]  - /ER/  imp-90 
[R]  - /ER/  imp-90 

"THERE"  receive*  upper  validity  of 

0 

Figure  2.  Contextual  rule  use. 


actions  of  lower-  level  sources  of  Knowledge.  An  example  of  the  application 
of  two  contextual  rules  is  given  in  Figure  2.  Using  only  context-independent 
matching,  a rating  of  68  would  be  given  the  word  hypothesis.  Application  of 
the  contextual  rules  increases  that  rating  to  93. 

Etisekaion  of  Word  Boundary  Ambiguities 

In  connected  speech,  word  boundaries  are  not  clearly  apparent  in  the  phonetic 
transcription  of  the  utterance.  This  presents  problems  in  searching  for  matches  for 
word  representations  in  the  phonetic  structure.  HSII,  using  structural  ad/acency  to 
extend  matches  from  established  surnemic  support,  reduces  this  problem.  As  words 
become  structurally  adjacent  (time-contiguous)  as  the  result  of  higher-level  sources  of 
knowledge,  word  bounda-y  times  can  be  sharpened.  Because  of  the  incremental  nature 
of  the  word-verifier  operation,  this  can  be  accomplished  in  most  cases  without 
reevaluating  matches  internal  to  the  word.  (The  mechanism  for  doing  this  matching  at 
word  boundaries  is  identical  to  that  used  intra-word;  this  is  a by-product  of  tho  fine 
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grain  of  the  control  slructure  of  the  word  verifier.)  The  word  boundary-times  are  then 
propagated  upward  through  the  data  structure  to  the  word  level  to  aid  those  upper- 
levei  knowledge  sources  in  recognition. 


Data  Representation  and  Program  Organization 

Knowledge  source  activation  is  data-directed.  In  general,  a precondition  for  e 
knowledge  source  monitors  certain  contexts  within  the  blackboard,  and,  if  patterns 
specified  by  that  precondition  occur,  the  KS  is  scheduled  for  activation.  Once 
activated,  the  KS  may  access  and  change  the  blackboard,  making  decisions  based  upon 
the  knowledge  it  has  about  the  levels  upon  which  it  operates. 

Word  verification  in  HSH  is  performed  by  four  separate  knowledge  source 
modules:  WOM  respells  word  hypotheses  Into  syllables;  MOS  respells  syllable 

hypotheses  into  dictionary  surnemic  spellings;  TIME  links  phones  to  surnemes  using 
specific  time  informal  ion  associated  with  the  hypotheses;  SEARCH  links  phones  to 
surnemes  using  previously  established  surneme-phone  matches  and  the  structural 
information  explicit  in  the  data  structure. 

Verifying  a Word  Hypothesis  - an  Ezamnle 

The  following  example  illustrates  one  way  in  which  a word  hypothesis  might  be 
verified.  In  this  example  (begun  in  Figure  3),  the  word  "GIVE"  has  been  already 
’recognized".  The  word  has  received  support  from  the  phonetic  level,  and  a high 
rating  has  been  assigned  to  the  word  hypothesis.  The  times  shown  (<10*3/25*3>)  are 
the  begin  and  end-times  and  ranges  (10  plus  or  minus  3 csec.  and  25  plus  or  minus  3 
csec.,  respectively)  associated  with  the  word  hypothesis.  The  syntax  and  semantics 
module  has  predicted  the  word  "US"  to  follow  "GIVE".  Only  Its  begln-time  is 
hypothesized:  beginning  at  time  25  (presumably  derived  from  the  end-time  of  the 
structurally  adjacent  "GIVE").  The  end-time  and  both  ranges  are  unknown  (shown  as 
"+"  in  the  figure).  The  following  steps  show  the  verification  of  the  word  hypothesis 
US'.  Each  step  represents  a separately  scheduled  activation  of  one  of  the  KSs 
described  above. 

STEP  1:  The  appearance  of  a new  hypothesis  at  the  word  levet  triggers  the 
precondition  of  the  respelling  KS.  When  the  KSls  activated,  the  word  "US"  Is  respelled 
into  its  dictionary  representation,  [AX][S],  at  the  surnemic  level1  (Figure  A). 


1 Note  that,  for  simplicity,  we  are  ignoring  the  intermediate  process  of  respellir.g  at 
the  syllable  level. 
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STEP  2:  The  matching  procedure  is  activated  and  uses  the  predicted  structural 

adjacency  with  the  word  "GIVE"  to  extend  its  search  for  matches  outward  from  the  last 
surneme  of  the  predictor  word  {[V]).  The  context  for  extending  matches  includes  the 
"given  surneme"  (previously  matched  surneme)  [V],  the  structurally  adjacent  surneme 
[AX],  the  "given  phone"  (previously  matched  phone)  / K /,  and  the  set  of  time  adjacent 
phones  {/V/}.  The  implication  of  each  surneme-phone  match  is  cons'dered,  and,  in  this 
case,  the  [V]«/V/  match  is  made  (Figure  5). 

STEP  In  the  next  instantiation  (triggered  by  the  creation  of  the  new  link),  the 
context  for  matching  is  the  given  surneme  [V],  the  given  phone  /V/,  the  adjacent 
surneme  [AX],  and  the  set  of  adjacent  phones:  {/UH/./IH/}.  Two  matches  are  decided 
upon:  [AX]-»/UH/  and  [AX]«/tH/.  Since  both  phones  have  the  same  begin  and  end- 
times,  an  option  node  is  created  at  the  phonetic  level  and  supported  by  both  phones. 
The  surneme  [AX]  is  linked  to  the  option  node.  The  implication  given  this  link  is 
considered  lo  be  the  highest  implication  from  the  phone  options.  The  word  boundary 
time  between  these  two  words  is  now  known,  and  is  propagated  upward  to  the  word 
level  (Figure  6). 

STEP  4:  With  the  next  KS  instantiation,  two  matches  are  possible:  [SWZj/  and 
[S]<>/S/.  Since  these  phones  have  different  end-times,  they  represent  possibly 
separate  matching  paths  through  the  phonetic  structure.  We  do  not  yet  know  which 
path  will  provide  the  best  overall  match;  therefore,  an  option  node  is  created  at  the 
surnemic  level  and  the  surneme  is  duplicated.  One  of  the  options  of  the  surneme  is 
matched  to  /Zj_ /,  the  other  to  the  /S/.  We  now  have  some  idea  for  the  end-time  of  the 
word  "US"  and  that  time  is  propagated  upward  to  the  word  level.  A range  is  given 
that  end-time  to  reflect  the  degree  of  uncertainty  about  what  the  final  end-time  will 
be  (Figure  7). 

At  this  point,  the  matching  process  cannot  continue  since  there  is  as  yet  no 
structurally  adjacent  surnemic  context  to  consider.  A phonetic  representation  of  the 
word  has  been  found  (though  not  yet  completed*).  The  word  boundary  ambiguity 
between  "GIVE"  and  "US"  has  been  resolved,  and  reasonable  begin  and  end-times  have 
been  associated  with  the  word  "US”. 


1 There  is  still  another  match  to  be  made:  one  of  the  options  of  [S]  to  the  phone  /Z2/. 
This  will  occur  when  another  word  is  hypothesized  structurally  adjacent  to  the  right 
of  "US"  and  the  matching  process  continues  from  the  link  [S]«/S/. 
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Preliminary  Results 

System  rtiN;.  tor  performance  evaluation  of  word  verification  were  made  over 
eleven  training  utterances  spoken  by  a single  speaker  using  a 275-word  vocabulary. 
Syntax  and  semantics  knowledge  sources  were  not  present;  all  word  candidates  were 
predicted  from  the  phonetic  hypotheses. 

In  the  eleven  utterances,  708  words  were  hypothesized;  30  of  these  were  the 
"correct"  words  (i.e.,  the  actual  word  spoken  hypothesized  in  the  correct  time  area  of 
the  utterance).  The  results  of  this  very  preliminary  evaluation  of  the  word  verification 
process  are  summarized  in  Figure  8. 


Percent  of  total  hypothesized 
words  verified 

58% 

Percent  of  correct 

97% 

words  verified 

Percent  of  time  correct  word 

48% 

was  highest  rated  in  its  lime  area 

Percent  of  time  correct  word  was  in 

82% 

top  5 rated  words  in  its  time  area 

Average  number  of  incorrect  words  rated 

2.3 

higher  than  correct  word  in  its  time  area 

Figure  8.  Preliminary  results  of  the  word  verifier. 


Improved  performance  of  the  word  verification  process  is  expected  both  in 
terms  of  decreasing  the  total  percentage  of  incorrect  words  verified,  and  also  in  terms 
of  increasing  the  ratings  of  correct  words.  Evaluation  of  performance  has  only  begun. 
Better  training  procedures  for  the  phone-surneme  similarity  matrix  will  be  defined, 
more  efficient  matching  thresholds  for  increased  discrimination  will  be  determined-,  and 
surnemic-contextual  rule  determination  may  be  automated  during  system  training. 

Research  is  continuing.  We  feel  that  we  have  a viable  design  fcr  word 
verification  which  will  allow  us  to  pursue  major  issues  in  depth. 
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INTRODUCTION 

This  paper  is  a description  of  the  linguistically-related  principles  behind 
PSYN,  the  current  phonetics  module  in  Hearsay  II.  It  serves  as  one  of  many  interactive 
knowledge  sources,  the  main  purpose  of  which  is  to  make  easier  the  mapping  of 
relatively  broad  phonetic  spellings  from  a dictionary  to  a (faulty)  narrow  phonetic 
transcription  made  by  an  automatic  segmenter  and  labeler.  PSYN  attempts  to  produce 
the  closest  thing  lo  a phonemic  transcription  that  it  can  without  higher-level  knowlege. 
Although  each  knowlege  source  in  HSU  has  the  potential  to  interact  with  any  other 
module,  we  will  make  the  following  explanations  as  if  it  were  a bottom-up  system,  for 
ease  of  understanding. 

Input  to  PSYN:  One  or  more  labels  is  assigned  to  each  acoustic  segment  in  the  speech 
input  as  determined  by  an  automatic  procedure.  Each  label  is  assigned  a rating 
indicating  the  degree  of  success  of  a pattern  match  which  produced  it.  These,  plus 
amplitude  maximum  and  minimum  locations,  are  the  information  which  is  available  to 
PSYN.  We  hope  to  include  reliable  voicing  and  frication  detectors,  which  will  give  us 
additional  cues,  but  at  present  we  work  only  on  labels,  durations,  and  max-mln 
information. 

Chanp.es  Performed  By  PSYN:  There  are  two  basic  kind  of  transformations  which  we 

ppply  to  these  input  data:  1)  mapping  onto  a feature  space  and  2)  reassigning 

boundaries  and  labels.  Each  input  segment  may  have  from  one  to  n labels.  But  the 
information  given  us  by  these  labels  is  ofien  difficult  to  interpret.  While  each  label 
may  supply  important  information,  it  is  sometimes  unclear  what  a particular 
configuration  of  them  means.  For  example,  suppose  we.  get  the  labels  UW,  AA,  L,  and  M 
for  a single  segment.  We  then  know  that  the  segment  involved  has  a strong  low- 
frequency  component  which  may  make  it  look  like  a consonant,  but  that  some  high 
frequency  information  is  in  evidence.  We  can  assume  it  to  be  a back  vowel  If  it  Is  • 
vowel,  but  as  a whole,  the  phonetic  identity  of  the  segment  is  difficult  to  evaluate.  We 
try  to  ameliorate  this  situation  by  decomposing  each  of  the  labels  into  features,  then 
trying  to  make  a betler  guess  at  what  segmental  quality  is  actually  present  by  which 
features  are  most  strongly  represented.  We  use  a set  of  13  articulatory  and  acoustic 
features,  modeled  on  past  feature  systems,  but  somewhat  different.  We  are  able  to 
quantize  the  strength  of  each  feature  in  two  ways:  a table  indicates  how  strongly 


Phonetic  Component  159 


4 


each  feature  is  present  in  each  element  in  the  ideal  case;  and  the  rating  on  the 
segment  supplies  us  with  a weighting  factor.  Sounds  are  represented  after  this 
transformation,  then,  as  a weighted  feature  vector  with  associated  max-min 
information.  Since  they  are  thus  decomposed,  it  is  easy  the  write  the  phonetic  rules 
described  below  with  reference  to  classes,  The  vectors  are  also  available  to  the 
phonological  component,  so  phonological  rules  can  be  expressed  similarly.  Note  that 
I he  features  we  use  are  extracted  completely  from  labels  at  present.  Therefore,  this 
scheme  bears  litlle  resemblance  to  those  which  try  to  detect  features  directly  from 
some  representation  of  the  speech  signal, 

All  of  the  transformations  described  in  2)  are  predicated  cn  sets  of  features,  rather 
than  labels,  feature  vectors  are  mapped  back  into  labels  before  they  are  output. 
Reassigning  boundaries  and  labels  is  the  transformation  usually  thought  of  as  acoustic- 
phonetic  rules.  To  do  this,  wo  use  information  derived  from  sequences  of  labels  and 
from  single  labels  in  some  cases.  There  are  three  kinds  of  transformations  which  take 
place  here:  sequential,  splitting,  and  positional  relabeling. 

Sequential  transformations  involve  combining  of  adjacent  fine  phonetic  segments  to 
make  larger  segments.  A well-known  example  is  the  combination  of  silence  + optional 
burst  + aspiration  into  a voiceless  stop.  If  we  receive  a sequence  of  labels: 

time  segment 

40  -45  - (silence) 

45-51  S 

we  will  create  the  hypothesis  that  a voiceless  stop,  probably  /T /,  exists  in  the  time 
span  between  centiseconds  40  and  51,  Of  course,  the  original  hypothesis  put  forth  by 
the  segmenter  and  labeler  is  still  present,  in  case  our  combination  of  these  segments  is 
in  error.  Duration  of  the  elements  involved  is  considered  in  this  rule,  as  in  most.  A 
less  commonly-discussed  but  very  commonly-found  articulatory  event  is  the  splitting  of 
a voiced  stop  into  two  portions  as  a result  of  back  pressure  behind  the  closure.  The 
voiced  stop  begins  voiced  and  therefore  is  assigned  a label  of,  e.g.,  M,  D,  or  V.  Then 
the  voicing  drops  off  which  produces  a silence.  This  is  interpreted  as  a new  segment 
by  the  automatic  devices.  Therefore,  there  is  a rule  in  PSYN  which  combines 
sequences  such  as: 

time  segment 

40-45  M,B 

45-50 

into  a voiced  stop  and  assigns  it  the  label  most  congruent  with  its  place-of-articulation 
features.  Other  combining  rules  in  PSYN  attempt  to  identify  short  vowels  "on  the 
shoulders  of"  long  vowels  as  transitions  and  to  combine  the  transitional  silences  which 
occasionally  occur  with  fricatives  with  their  accompanying  /$/,  /F/,  etc.  We  have 
worked  on  diphthong  rules,  but  have  met  with  little  success, 

Splitting  tran^-ormations  make  a single  segment. into  two  (or  theoretically  more).  The 
only  rule  of  this  type  which  we  have  at  present  takes  a long,  highly  velarized,  vocalic 
segment  and  splits  it  into  vowel  + L,  with  appropriate  durations. 

The  third  type  of  transformation  is  positional  retabling,  which  may  be  considered 
equivalent  in  part  to  aliophony  rules.  For  example,  if  we  find  a highly  velarized  vowel 
which  a)  has  no  local  maximum  and  b)  is  next  to  another  vowel,  we  hypothesize  it  could 
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he  W or  L;  if  it  meets  the  above  requirements  but  is  retroflex  instead  of  velarized,  we 
rail  it  /R/. 

In  a very  broad  sense,  then,  PSYN  accepts  a set  of  rated  labels  and  max-min 
nformation,  maps  the  labels  into  a feature  space,  makes  new  hypotheses  based  on  sets 
of  features,  and  outputs  a new  set  of  labels  congruent  with  those  hypotheses.  While 
oui  rate  of  correct  output  labels  remains  less  than  402,  we  find  that,  given  the  best 
path  through  the  various  alternative  hypotheses,  we  reduce  the  number  of  extra 
segments  generated  by  the  automatic  segmenter  by  about  502.  When  11  larger 
classes  (high  vowels,  low  vowels,  fricatives,  nasals,  etc.)  are  considered,  the 
percentage  correct  averages  around  75. 


IMPLEMENTATION 

The  bottom  up  processing  of  Hearsay!!  begins  with  an  acoustic  segmentation  and 
labeling.  On  the  basis  of  this  acoustic  information,  the  phone  synthesizing  module 
(PSYN)  provides  a phonetic  transcription  of  the  speech  signal,  Other  sources  of 
knowledge  then  use  this  transc'iption  to  hypothesize  syllables  and  words. 

PSYN  is  composed  of  two  major  sources  of  information.  The  first  source  is  an  acoustic 
representation  of  the  speech  signal.  Feature  vectors  are  used  to  describe  the  labels 
which  are  assigned  lo  the  acoustic  segments.  Amplitudal  information  is  also  provided 
in  the  form  of  MXNs  which  are  hypothesized  at  the  segmental  level.  The  second 
source  of  information  present  in  PSYN  is  a set  of  phonetic  rules.  Rules  written  In  a 
similar  manner  lo  those  of  tradilional  phonology  are  read  by  PSYN  and  are  applied  to 
the  segmental  hypotheses.  Phonetic  hypotheses  are  made  as  indicated  by  these  rules. 

Low-I.evel  Combining:  There  are  two  Hearsayll  knowledge  sources  within  the  PSYN 
module;  CSEG  and  PSYN.  A preliminary  pass  is  made  on  the  segments  by  the  low-level 
combiner  called  CSFG.  CSEG  is  invoked  by  a set  of  unused  segmental  hypotheses 
which  have  a left  and  l ight  context.  The  knowledge  source  then  acts  on  each  segment 
individually.  The  time  boundaries,  acoustic  labels,  and  a weighted  feature  vector 
formed  from  the  labels  of  the  segment  are  accessed  from  the  data  base  and  lexicons. 
A feature  comparison  is  made  between  the  segment  and  each  context.  If  the  two 
segments  differ  in  any  feature  by  more  than  a given  threshold,  a low  level  combination 
is  not  performed.  In  addition,  two  segments  are  not  combined  if  both  have  a local 
maximum  in  amplitude.  If  the  features  of  any  two  or  three  adjacent  segments  are 
close  enough  to  pass  both  comparison  tests,  a combining  operation  is  performed. 
Labels  for  the  new  segment  are  obtained  by  determining  the  best  matches  between 
the  averaged  features  of  the  segments  and  the  labels  in  the  segmental  feature  file. 
The  best  labels  are  combined  to  identify  the  new  segment.  Ratings  are  based  on  the 
distance  from  each  cnosen  label  to  the  averaged  features.  The  new  feature  vector  is 
formed  from  the  averaged  features  of  the  individual  segments  which  form  the  new 
segment. 

Phone  Synthesis:  After  the  low-level  combining  operation  is  attempted,  the  phonetic 
processing  of  the  segment,  is  performed  by  the  other  phonetic  knowledge  source, 
PSYN,  which  also  requires  a segment  with  a left  and  right  context.  The  feature  vectors 
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of  the  segments  involved  are  used  to  represent  the  acoustic  information.  A series  of 
phonetic  rules  is  used  to  lest  the  segment  and  its  contexts.  If  the  required  conditions 
are  met  for  any  ot  these;  rules,  phones  of  the  appropriate  class  are  proposed.  If  none 
of  the  phonetic  rules  apply,  a relabeling  process  based  on  the  set  of  all  phones  is 
performed.  The  set  of  phones  proposed  in  each  case  is  chosen  by  comparing  the 
weighted  feature  vector  of  the  segment  with  the  feature  vector  of  the  labels  in  the 
desired  phone  class.  Those  phones  found  to  be  the  closest  are  selected.  The  best 
match  is  determined  by  a euclidean-type  distance  measure  between  the  two  vectors. 
A rating  routine  assigns  an  implication  to  all  possible  links,  based  on  the  euclidean 
distance  measure  and  a multiplicative  factor  assigned  to  e3ch  link  within  the  rule..  After 
all  possible  phones  for  a segmental  triple  have  been  oroposed  and  rated  by  the 
individual  actions,  the  best  (highest  rated)  phones  are  hypothesized. 


PHONETIC  RULES 

7 ho  phonetic  rules  used  by  PSYN  are  written  in  production  system  format.  The 
condition  pari  of  the  rule  is  defined  by  a list  feature  requirements  which  must  all  be 
satisfied  for  Ihe  rule  to  apply.  Feature  tests  for  each  segment  arfe  expressed  in  terms 
of  a feature  name  and  a threshold  level.  Both  the  feature  names  and  the  threshold 
levels  are  predeelarod  by  the  module. 

The  action  portion  of  the  rule  is  defined  by  a procedure  call.  Each  rule  action  defines 
a subset  of  phones  for  relabeling,  performs  the  relabeling  based  on  feature  vectors, 
and  rates  the  resulting  labels. 

There  are  three  categories  of  phonetic  rules  implemented  in  PSYN.  The  simplest  rules 
are?  relabeling  rules.  A single  segment  is  matched  with  the  phone  labels  in  a subset 
specified  by  the  rule.  The  phones  hypothesized  will  span  the  same  time  interval  as 
the  segment.  Context  links  may  be  formed  if  relevant.  A more  complex  rule  type  is  a 
combining  rule.  Two  or  three  segments  are  combined  to  form  a single  phone  which 
spans  the  time  interval  of  all  segments  involved.  Occasionally  PSYN  may  determine 
that  a segment  is  missing  at  the  lower  level.  The  third  type  of  rule  form  sp'its  such  a 
segment  into  two  phones. 


.Explanation  of  Features:  High,  mid,  and  low  (HI, MID, 10)  refer  to  degree  of  closure.  It 
is  greatest  for  stops,  least  for  open  vowels.  Front,  central,  and  back  (FRNT,CENT,BK) 
refer  to  point  of  articulation  or,  for  vowels  and  glides,  greatest  point  of  approximation. 
Rounded,  retroflexed,  verarized,  voiced,  fricative,  vocalic,  consonantal,  and  diphthongal 
(RND,RTR,VEL,NAS,VCD,FRC,VOC,CON,DIPH)  are  used  in  their  usual  senses.  H»amp  (HIA) 
indicates  a segment  which  is  expected  to  hava  relatively  high  energy,  as  for  example 
[s]  as  opposed  to  [f].  Null  (NUL)  indicates  very,  low-energy  segments,  low  frequency 
(EOF)  is  a feature  shared  by  all  segments  with  a concentration  of  energy  at  the  bottom 
of  the  spectrum,  e.g.  nasals,  I,  and  uw.  "MXN"  describes  an  actual  amplitude  contour  as 
discovered  by  lower-level  processes.  Prefixes  such  as  "NU",  "HI",  etc.  indicate 
threshold  levels,  which  are  set  by  the  user. 
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Phonetic  Rules  in  P$YN 


FLAP:  j.  i 

f LOMXnI  / f*HIMXfTj  HIMXN 

MINI  -4  FLAP/  _ HIVOC],  HIVOCj. 

L~nunasJ  / L L 


NASAL: 


NUNAS 

LOMXN 

k 


] 


L: 


fNUVEL  1 
STMXN  [ 

HILOF  J 


-MDRND 

NUVCL 

-MDMXN 

MDVOC 

NI.NG 


VOWEL: 


NASAL 


L 


->  L 


-»  VOWEL 


[himxn! 

HIVOC  I 

[NUVOC  I / ! 

-vVWMXNI  -4  VOWEL2  / I 


NUCON1  I NUCON3 
VOWEL  2 / -MDVOC  | -VWMXNI  -MDVOC  |~VWMXN 


^VWMXfj 


X 

[ -MDVOC  I -4  VOWEL  /TmDNUL  | MDFRCj 
-HICON  I / -» 
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CONSONANT: 


[M(XX)N  L-  ■» 

-MDMXN  IhICONI 


CONSONANT 


SYLLABIC  RESONANTS: 


HIVEL 

NUCON 

H1MXN 

» « 


EL 


HIRTR 

SLCON 

HIMXN 

0 


ER 


HINAS 

NIXON 

HIMXN 


EN  | EM 


MDVEL  I MDRTR 


■] 


SYLIRES 


W: 


MDRND 

NUVEL 

~MDMXN 

MDVOC 

Nl.NG 


-> 


[HIMXN  1 
HIVOC 


Phonetic  Component  164 


% 


COMBINING  RULES 
VOICELESS  STOP: 


£ slnul  j 


fsENU-J 


MINI 

NIJI'RC 

MDCON 


NI.NG 

NUI'RC 

MDCON 


JslnulJ 


NLNG  . 
MDFRC 
MDCON 


VOICELESSISTOP 


LOFRC 

NUVOC 

LOCON 

MINI 


NLNG 

NUI'RC 

MDCON 


VOICED  STOP: 


• 

r 

MDHI 

HIHI 

NUVCD 

r 1 

NUFRC 

MDCON 

4 SLNUL 

HICON 

SLLOF 

k J 

LOVCD 

-MDNAS 

MINI 

» m 

• « 

Jmdnul| 


■4  VOICEDISTOP 


FRICATIVES: 

[mini  \ 

SLNUL 
I SHRT 

f SLNUL 1 
I MINI  J 


HIFRC 
I MDCON 


HIFRC 

MDCON 

HIFRC 

~MINI 


[mini 

■ 

» 

t 

MINI 

SLNUL 

HIFRC 

SLNUL 

[SHRT 

MDCON 
• • 

4- 

SHRT 

->  FRICATIVE 


[hifrc]  +[  hifrc]  | [hifrc]  + [hifrcJ  + [hifrcJ 


-»  FRIC 
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VOWEL: 


% 


"I 

t 

-NOVEL 

- - 

r 

HIVOC 

HIMXN 

HIMXN 

NI.NG 

+ 

HIVOC 

HIVOC 

L 

-MDMXN 

• - 

* 

1 

-NUVEL 

r *1 

-NUVEL 

HlVOC 

HIMXN 

MDVOC 

NLNG 

+ HIVOC  + 

NLNG 

-MDMXN 

l J 

-MDMXN 

■ 

. 

NUH1 

V- 

-NUVEL 

MDVOC 

NLNG 

-MDMXN 

NUHI 

. 

->  EX1V0WEL 


AW: 


■ Of 

• 

SLFRNT 

SLl.O 

-HIM) 

MDVEL 

LODIPH 

4 

HIVOC 

HIVOC 

SLBK 

l • 

LODIPH 

t , 

SPLITTING  RULES 
L: 


m m 

r 

VWMXN 

. -MDVEL 

MDVEL 

-»  VOW  * I f -MDCON 

-MDCON 

» ■ 

PERFORMANCE  EVALUATION 

Performance  evaluation  is  done  by  a program  which  compares  the  phonetic  hypotheses 
to  a hand  transcription.  The  best  path  of  phones  is  matched  with  the  hand  phones  and 
statistics  are  collected.  The  current  version  of  PSYN  produces  the  following  results: 


total  matched 

611 

t correct 

29 

1,  correct  1st  choice 

23 

tt  label; 

2089 

extra  segments 

268 

missing  segments 

.41 
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APPENDIX;  Confusion  Matrices 


Confusion  matrices  based  on  all  labels  and  smaller  label  sets  are  also  available.  Three 
phone  label  subsets  are  generally  used  for  evaluation  purposes: 

CL1: 

HIGHLY, IH, IX, UW.UH 

MID”  EY,l:H,OW,AX,ER 

LOW"  AE,AA,AO 

FRN‘MV,IH,FY,EH,AE 

CN‘MX,AX,ER 

BAK-UW,UH,OW,AO,AA 

D1PH'-AY,AW,0Y 

LIQ'L,R,EL 

GL-W.Y 

NAS-M,N,NX,EM,EN 

VOST-BAG 

VI.$T"P,T,K,Q 

FLAP-DX 

VOFR=DH,V,WH,Z,ZH 

VLFR=TH,F,S,SH,HH 

GRB-!,& 

CL2: 

VOWEL  S "HI  GH,M!  D.LOW.F  RNT,CNT,BAK,DIPH 

RES~NAS,GL,LIQ 

ST-VOST.VLST 

FLAP=FLAP 

FR=VOFR,VLFR 

GRB-GRB 

SIL=SIL 

CL3: 

HIGHLY, IH, IX, UW.UH 

M1D"EY,EH,0W,AX,ER 

LOW-AE.AA.AO 

D1PH-AY.AW.OY 

LIQ"I.,R,EL 

GL=W,Y 

NAS=M,N,NX,EM,EN 
VOST-B,D,G 
VL$AP,T,  K,Q 
FLAP=DX 

VOFR=DH,V,WH,Z,ZH 

vlfr=th,f,s,sh,hh 

GRB=!,& 

SIL=- 
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Much  of  the  worl  done  on  the  phonetic  component  up  until  this  time  has 
been  in  the  realm  of  representation  of  knowledge  and  contre1.  The  feature  space  and 
pattern  matching  facilities  have  been  designed  in  such  a way  that  new  rules,  features, 
and  labels  can  be  handled  easily.  Complete  output  facilities  are  also  available  to  aid  in 
the  analysis  and  formulation  of  rules.  Future  work  in  the  phonetic  component  will  be 
concentrated  on  improvement  in  the  current  rules  and  the  addition  of  new  rules.  It  is 
expected  that  the  evaluation  statistics  will  rapidly  show  improved  performance  as  new 
rules  are  put  into  the  system. 
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